Induction is not optional (if you’re using inferential statistics): reply to Lakens

A few months ago, I posted an online preprint titled The Generalizability Crisis. Here’s the abstract:

Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferential statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and statistical expressions of a hypothesis being closely aligned—that is, that the two must refer to roughly the same set of hypothetical observations. Here I argue that most inferential statistical tests in psychology fail to meet this basic condition. I demonstrate how foundational assumptions of the “random effects” model used pervasively in psychology impose far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints dramatically inflates false positive rates and routinely leads researchers to draw sweeping verbal generalizations that lack any meaningful connection to the statistical quantities they’re putatively based on. I argue that failure to consider generalizability from a statistical perspective lies at the root of many of psychology’s ongoing problems (e.g., the replication crisis), and conclude with a discussion of several potential avenues for improvement.

I submitted the paper to Behavioral and Brain Sciences, and recently received 6 (!) generally positive reviews. I’m currently in the process of revising the manuscript in response to a lot of helpful feedback (both from the BBS reviewers and a number of other people). In the interim, however, I’ve decided to post a response to one of the reviews that I felt was not helpful, and instead has had the rather unfortunate effect of derailing some of the conversation surrounding my paper.

The review in question is by Daniel Lakens, who, in addition to being one of the BBS reviewers, also posted his review publicly on his blog. While I take issue with the content of Lakens’s review, I’m a fan of open, unfiltered, commentary, so I appreciate Daniel taking the time to share his thoughts, and I’ve done the same here. In the rather long piece that follows, I argue that Lakens’s criticisms of my paper stem from an incoherent philosophy of science, and that once we amend that view to achieve coherence, it becomes very clear that his position doesn’t contradict the argument laid out in my paper in any meaningful way—in fact, if anything, the former is readily seen to depend on the latter.

Lakens makes five main points in his review. My response also has five sections, but I’ve moved some arguments around to give the post a better flow. I’ve divided things up into two main criticisms (mapping roughly onto Lakens’s points 1, 4, and 5), followed by three smaller ones you should probably read only if you’re entertained by petty, small-stakes academic arguments.

Bad philosophy

Lakens’s first and probably most central point can be summarized as a concern with (what he sees as) a lack of philosophical grounding, resulting in some problematic assumptions. Lakens argues that my paper fails to respect a critical distinction between deduction and induction, and consequently runs aground by assuming that scientists (or at least, psychologists) are doing induction when (according to Lakens) they’re doing deduction. He suggests that my core argument—namely, that verbal and statistical hypotheses have to closely align in order to support sensible inference—assumes a scientific project quite different from what most psychologists take themselves to be engaged in.

In particular, Lakens doesn’t think that scientists are really in the business of deriving general statements about the world on the basis of specific observations (i.e., induction). He thinks science is better characterized as a deductive enterprise, where scientists start by positing a particular theory, and then attempt to test the predictions they wring out of that theory. This view, according to Lakens, does not require one to care about statistical arguments of the kind laid out in my paper. He writes:

Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments'”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.

Lakens’s position is that theoretical hypotheses are not inferred from the data in a bottom-up, post-hoc way—i.e., by generalizing from finite observations to a general regularity—rather, they’re formulated in advance of the data, which is then only used to evaluate the tenability of the theoretical hypothesis. This, in his view, is how we should think about what psychologists are doing—and he credits this supposedly deductivist view to philosophers of science like Popper and Lakatos:

Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the effect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.”

Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.

For reasons that will become clear shortly, I think Lakens’s appeal to Popper and Lakatos here is misguided—those philosophers’ views actually have very little resemblance to the position Lakens stakes out for himself. But let’s start with the distinction Lakens draws between induction and deduction, and the claim that the latter provides an alternative to the former—i.e., that psychologists can avoid making inductive claims if they simply construe what they’re doing as a form of deduction. While this may seem like an intuitive claim at first blush, closer inspection quickly reveals that, far from psychologists having a choice between construing the world in deductive versus inductive terms, they’re actually forced to embrace both forms of reasoning, working in tandem.

There are several ways to demonstrate this, but since Lakens holds deductivism in high esteem, we’ll start out from a strictly deductive position, and then show why our putatively deductive argument eventually requires us to introduce a critical inductive step in order to make any sense out of how contemporary psychology operates.

Let’s start with the following premise:

P1: If theory T is true, we should confirm prediction P

Suppose we want to build a deductively valid argument that starts from the above premise, which seems pretty foundational to hypothesis-testing in psychology. How can we embed P1 into a valid syllogism, so that we can make empirical observations (by testing P) and then updating our belief in theory T? Here’s the most obvious deductively valid way to complete the syllogism:

P1: If theory T is true, we should confirm prediction P
P2: We fail to confirm prediction P
C: Theory T is false

So stated, this modus tollens captures the essence of “naive“ Popperian falsficationism: what scientists do (or ought to do) is attempt to disprove their hypotheses. On this view, if a theory T legitimately entails P, then disconfirming P is sufficient to falsify T. Once that’s done, a scientist can just pack it up and happily move onto the next theory.

Unfortunately, this account, while intuitive and elegant, fails miserably on the reality front. It simply isn’t how scientists actually operate. The problem, as Lakatos famously pointed out, is that the “core“ of a theory T never strictly entails a prediction P by itself. There are invariably other auxiliary assumptions and theories that need to hold true in order for the T → P conditional to apply. For example, observing that people walk more slowly out of a testing room after being primed with old age-related words than with youth-related words doesn’t provide any meaningful support for a theory of social priming unless one is willing to make a large number of auxiliary assumptions—for example, that experimenter knowledge doesn’t inadvertently bias participants; that researcher degrees of freedom have been fully controlled in the analysis; that the stimuli used in the two conditions don’t differ in some irrelevant dimension that can explain the subsequent behavioral change; and so on.

This “sophisticated falsificationism“, as Lakatos dubbed it, is the viewpoint that I gather Lakens thinks most psychologists implicitly subscribe to. And Lakens believes that the deductive nature of the reasoning articulated above is what saves psychologists from having to worry about statistical notions of generalizability.

Unfortunately, this is wrong. To see why, we need only observe that the Popperian and Lakatosian views frame their central deductive argument in terms of falsificationism: researchers can disprove scientific theories by failing to confirm predictions, but—as the Popper statement Lakens approvingly quotes suggests—they can’t affirmatively prove them. This constraint isn’t terribly problematic in heavily quantitative scientific disciplines where theories often generate extremely specific quantitative predictions whose failure would be difficult to reconcile with those theories’ core postulates. For example, Einstein predicted the gravitational redshift of light in 1907 on the basis of his equivalence principle, yet it took nearly 50 years to definitively confirm that prediction via experiment. At the time it was formulated, Einstein’s prediction would have made no sense except in light of the equivalence principle—so the later confirmation of the prediction provided very strong corroboration of the theory (and, by the same token, a failure to experimentally confirm the existence of redshift would have dealt general relativity a very serious blow). Thus, at least in those areas of science where it’s possible to extract extremely “risky“ predictions from one’s theories (more on that later), it seems perfectly reasonable to proceed as if critical experiments can indeed affirmatively corroborate theories—even if such a conclusion isn’t strictly deductively valid.

This, however, is not how almost any psychologists actually operate. As Paul Meehl pointed out in his seminal contrast of standard operating procedures in physics and psychology (Meehl, 1967), psychologists almost never make predictions whose disconfirmation would plausibly invalidate theories. Rather, they typically behave like confirmationists, concluding, on the basis of empirical confirmation of predictions, that their theories are supported (or corroborated). But this latter approach has a logic quite different from the (valid) falsificationist syllogism we saw above. The confirmationist logic that pervades psychology is better represented as follows:

P1: If theory T is true, we should confirm prediction P
P2: We confirm prediction P
C: Theory T is true

C would be a really nice conclusion to draw, if we were entitled to it, because, just as Lakens suggests, we would then have arrived at a way to deduce general theoretical statements from finite observations. Quite a trick indeed. But it doesn’t work; the argument is deductively invalid. If it’s not immediately clear to you why, consider the following argument, which has exactly the same logical structure:

Argument 1
P1: If God loves us all, the sky should be blue
P2: The sky is blue
C: God loves us all

We are not concerned here with the truth of the two premises, but only with the validity of the argument as a whole. And the argument is clearly invalid. Even if we were to assume P1 and P2, C still wouldn’t follow. Observing that the sky is blue (clearly true) doesn’t entail that God loves us all, even if P1 happens to be true, because there could be many other reasons the sky is blue that don’t involve God in any capacity (including, say, differential atmospheric scattering of different wavelengths of light), none of which are precluded by the stated premises.

Now you might want to say, well, sure, but Argument 1 is patently absurd, whereas the arguments Lakens attributes to psychologists are not nearly so silly. But from a strictly deductive standpoint, the typical logic of hypothesis testing in psychology is exactly as silly. Compare the above argument with a running example Lakens (following my paper) uses in his review:

Argument 2
P1: If the theory that cleanliness reduces the severity of moral judgments is true, we should observe condition A > condition B, p < .05
P2: We observe condition A > condition B, p < .05
C: Cleanliness reduces the severity of moral judgments

Subjectively, you probably find this argument much more compelling than the God-makes-the-sky-blue version in Argument 1. But that’s because you’re thinking about the relative plausibility of P1 in the two cases, rather than about the logical structure of the argument. As a purportedly deductive argument, Argument 2 is exactly as bad as Argument 1, and for exactly the same reason: it affirms the consequent. C doesn’t logically follow from P1 and P2, because there could be any number of other potential premises (P3…Pk) that reflect completely different theories yet allow us to derive exactly the same prediction P.

This propensity to pass off deductively nonsensical reasoning as good science is endemic to psychology (and, to be fair, many other sciences). The fact that the confirmation of most empirical predictions in psychology typically provides almost no support for the theories those predictions are meant to test does not seem to deter researchers from behaving as if affirmation of the consequent is a deductively sound move. As Meehl rather colorfully wrote all the way back in 1967:

In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network.

Meehl was hardly alone in taking a dim view of the kind of argument we find in Argument 2, and which Lakens defends as a perfectly respectable “deductive“ way to do psychology. Lakatos—the very same Lakatos that Lakens claims he “is on the side of“—was no fan of it either. Lakatos generally had very little to say about psychology, and it seems pretty clear (at least to me) that his views about how science works were rooted primarily in consideration of natural sciences like physics. But on the few occasions that he did venture an opinion about the “soft“ sciences, he made it abundantly clear that he was not a fan. From Lakatos (1970) :

This requirement of continuous growth … hits patched-up, unimaginative series of pedestrian “˜empirical’ adjustments which are so frequent, for instance, in modern social psychology. Such adjustments may, with the help of so-called “˜statistical techniques’, make some “˜novel’ predictions and may even conjure up some irrelevant grains of truth in them. But this theorizing has no unifying idea, no heuristic power, no continuity. They do not add up to a genuine research programme and are, on the whole, worthless1.

If we follow that footnote 1 after “worthless“, we find this:

After reading Meehl (1967) and Lykken (1968) one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and thereby a semblance of “scientific progress” where, in fact, there is nothing but an increase in pseudo-intellectual garbage. “¦ It seems to me that most theorizing condemned by Meehl and Lykken may be ad hoc3. Thus the methodology of research programmes might help us in devising laws for stemming this intellectual pollution …

By ad hoc3, Lakatos means that social scientists regularly explain anomalous findings by concocting new post-hoc explanations that may generate novel empirical predictions, but don’t follow in any sensible way from the “positive heuristic“ of a theory (i.e., the set of rules and practices that describe in advance how a researcher ought to interpret and respond to discrepancies). Again, here’s Lakatos:

In fact, I define a research programme as degenerating even if it anticipates novel facts but does so in a patched-up development rather than by a coherent, pre-planned positive heuristic. I distinguish three types of ad hoc auxiliary hypotheses: those which have no excess empirical content over their predecessor (‘ad hoc1’), those which do have such excess content but none of it is corroborated (‘ad hoc2’) and finally those which are not ad hoc in these two senses but do not form an integral part of the positive heuristic (‘ad hoc3’). “¦ Some of the cancerous growth in contemporary social ‘sciences’ consists of a cobweb of such ad hoc3 hypotheses, as shown by Meehl and Lykken.

The above quotes are more or less the extent of what Lakatos had to say about psychology and the social sciences in his published work.

Now, I don’t claim to be able to read the minds of deceased philosophers, but in view of the above, I think it’s safe to say that Lakatos probably wouldn’t have appreciated Lakens claiming to be “on his side“. If Lakens wants to call the kind of view that considers Argument 2 a good way to do empirical science, fine; but I’m going to refer to it as Lakensian deductivism from here on out, because it’s not deductivism in any sense that approximates the normal meaning of the word “deductive“ (I mean, it’s actually deductively invalid!), and I suspect Popper, Lakatos, and Meehl­ might have politely (or maybe not so politely) asked Lakens to cease and desist from implying that they approve of, or share, his views.

Induction to the rescue

So far, things are not looking so good for a strictly deductive approach to psychology. If we follow Lakens in construing deduction and induction as competing philosophical worldviews, and insist on banishing any kind of inductive reasoning from our inferential procedures, then we’re stuck facing up to the fact that virtually all hypothesis testing done by psychologists is actually deductively invalid, because it almost invariably has the logical form captured in Argument 2. I think this is a rather unfortunate outcome, if you happen to be a proponent of a view that you’re trying to convince people merits the label “deduction“.

Fortunately, all is not lost. It turns out that there is a way to turn Argument 2 into a perfectly reasonable basis for doing empirical science of the psychological variety. Unfortunately for Lakens, it runs directly through the kinds of arguments laid out in my paper. To see that, let’s first observe that we can turn the logically invalid Argument 2 into a valid syllogism by slightly changing the wording of P1:

Argument 3
P1: If, and only if, cleanliness reduces the severity of moral judgments, we should find that condition A > condition B, p < .05
P2: We find that condition A > condition B, p < .05
C: Cleanliness reduces the severity of moral judgments

Notice the newly-added words and only if in P1. They makes all the difference! If we know that the prediction P can only be true if theory T is correct, then observing P does in fact allow us to deductively conclude that T is correct. Hooray!

Well, except that this little modification, which looks so lovely on paper, doesn’t survive contact with reality, because in psychology, it’s almost never the case that a given prediction could only have plausibly resulted from one’s favorite theory. Even if you think P1 is true in Argument 2 (i.e., the theory really does make that prediction), it’s clearly false in our updated Argument 3. There are lots of other reasons why we might observe the predicted result, p < .05, even if the theoretical hypothesis is false (i.e., if cleanliness doesn’t reduce the severity of moral judgment). For example, maybe the stimuli in condition A differ on some important but theoretically irrelevant dimension from those in B. Or maybe there are demand characteristics that seep through to the participants despite the investigators’ best efforts. Or maybe the participants interpret the instructions in some unexpected way, leading to strange results. And so on.

Still, we’re on the right track. And we can tighten things up even further by making one last modification: we replace our biconditional P1 above with the following probabilistic version:

Argument 4
P1: It’s unlikely that we would observe A > B, p < .05, unless cleanliness reduces the severity of moral judgments
P2: We observe A > B, p < .05
C1: It’s probably true that cleanliness reduces the severity of moral judgments

Some logicians might quibble with Argument 4, because replacing words like “all“ and “only“ with words like “probably“ and “unlikely“ requires some careful thinking about the relationship between logical and probabilistic inference. But we’ll ignore that here. Whatever modifications you need to make to enable your logic to handle probabilistic statements, I think the above is at least a sensible way for psychologists to proceed when testing hypotheses. If it’s true that the predicted result is unlikely unless the theory is true, and we confirm the prediction, then it seems reasonable to assert (with full recognition that one might be wrong) that the theory is probably true.

But now the other shoe drops. Because even if we accept that Argument 4 is (for at least some logical frameworks) valid, we still need to show that it’s sound. And soundness requires the updated P1 to be true. If P1 isn’t true, then the whole enterprise falls apart again; nobody is terribly interested in scientific arguments that are logically valid but empirically false. We saw that P1 in Argument 2 was uncontroversial, but was embedded in a logically invalid argument. And conversely, P1 in Argument 3 was embedded in a logically valid argument, but was clearly indefensible. Now we’re suggesting that P1 in Argument 4, which sits somewhere in between Argument 2 and Argument 3, manages to capture the strengths of both of the previous arguments, while avoiding their weaknesses. But we can’t just assert this by fiat; it needs to be demonstrated somehow. So how do we do that?

The banal answer is that, at this point, we have to start thinking about the meanings of the words contained in P1, and not just about the logical form of the entire argument. Basically, we need to ask ourselves: is it really true that all other explanations for the predicted statistical result, are, in the aggregate, unlikely?

Notice that, whether we like it or not, we are now compelled to think about the meaning of the statistical prediction itself. To evaluate the claim that the result A > B (p < .05) would be unlikely unless the theoretical hypothesis is true, we need to understand the statistical model that generated the p-values in question. And that, in turn, forces us to reason inductively, because inferential statistics is, by definition, about induction. The point of deploying inferential statistics, rather than constraining one’s self to only describing the sampled measurements, is to generalize beyond the observed sample to a broader population. If you want to know whether the predicted p-value follows from your theory, you need to know whether the population your verbal hypothesis applies to is well approximated by the population your statistical model affords generalization to. If it isn’t, then there’s no basis for positing a premise like P1.

Once we’ve accepted this much—and to be perfectly blunt about it, if you don’t accept this much, you probably shouldn’t be using inferential statistics in the first place—then we have no choice but to think carefully about the alignment between our verbal and statistical hypotheses. Is P1 in Argument 4 true? Is it really the case that observing A > B, p < .05, would be unlikely unless cleanliness reduces the severity of moral judgments? Well that depends. What population of hypothetical observations does the model that generates the p-value refer to? Does it align with the population implied by the verbal hypothesis?

This is the critical question one must answer, and there’s no way around it. One cannot claim, as Lakens tries to, that psychologists don’t need to worry about inductive inference, because they’re actually doing deduction. Induction and deduction are not in opposition here; they’re actually working in tandem! Even if you agree with Lakens and think that the overarching logic guiding psychological hypothesis testing is of the deductive form expressed in Argument 4 (as opposed to the logically invalid form in Argument 2, as Meehl suggested), you still can’t avoid the embedded inductive step captured by P1, unless you want to give up the use of inferential statistics entirely.

The bottom line is that Lakens—and anyone else who finds the flavor of so-called deductivism he advocates appealing—faces a dilemma on two horns. One way to deal with the fact that Lakensian deductivism is in fact deductively invalid is to lean into it and assert that, logic notwithstanding, this is just how psychologists operate, and the important thing is not whether or not the logic makes deductive sense if you scrutinize it closely, but whether it allows people to get on with their research in a way they’re satisfied with.

The upside of such a position is that it allows you to forever deflect just about any criticism of what you’re doing simply by saying “well, the theory seems to me to follow from the prediction I made“. The downside—and it’s a big one, in my opinion—is that science becomes a kind of rhetorical game, because at that point there’s pretty much nothing anybody else can say to disabuse you of the belief that you’ve confirmed your theory. The only thing that’s required is that the prediction make sense to you (or, if you prefer, to you plus two or three reviewers). A secondary consequence is that it also becomes impossible to distinguish the kind of allegedly scientific activity psychologists engage in from, say, postmodern scholarship, so a rather unwelcome conclusion of taking Lakens’s view seriously is that we may as well extend the label science to the kind of thing that goes on in journals like Social Text. Maybe Lakens is okay with this, but I very much doubt that this is the kind of worldview most psychologists want to commit themselves to.

The more sensible alternative is to accept that the words and statistics we use do actually need to make contact with a common understanding of reality if we’re to be able to make progress. This means that when we say things like “it’s unlikely that we would observe a statistically significant effect here unless our theory is true“, evaluation of such a statement requires that one be able to explain, and defend, the relationship between the verbal claims and the statistical quantities on which the empirical support is allegedly founded.

The latter, rather weak, assumption—essentially, that scientists should be able to justify the premises that underlie their conclusions—is all my paper depends on. Once you make that assumption, nothing more depends on your philosophy of science. You could be a Popperian, a Lakatosian, an inductivist, a Lakensian, or an anarchist… It really doesn’t matter, because, unless you want to embrace the collapse of science into postmodernism, there’s no viable philosophy of science under which scientists get to use words and statistics in whatever way they like, without having to worry about the connection between them. If you expect to be taken seriously as a scientist who uses inferential statistics to draw conclusions from empirical data, you’re committed to caring about the relationship between the statistical models that generate your p-values and the verbal hypotheses you claim to be testing. If you find that too difficult or unpleasant, that’s fine (I often do too!); you can just drop the statistics from your arguments, and then it’s at least clear to people that your argument is purely qualitative, and shouldn’t be accorded the kind of reception we normally reserve (fairly or unfairly) for quantitative science. But you don’t get to claim the prestige and precision that quantitation seems to confer on researchers while doing none of the associated work. And you certainly can’t avoid doing that work simply by insisting that you’re doing a weird, logically fallacious, kind of “deduction“.

Unfair to severity

Lakens’s second major criticism is that I’m too hard on the notion of severity. He argues that I don’t give the Popper/Meehl/Mayo risky prediction/severe testing school of thought sufficient credit, and that it provides a viable alternative to the kind of position he takes me to be arguing for. Lakens makes two main points, which I’ll dub Severity I and Severity II.

Severity I

First, Lakens argues that my dismissal of risky or severe tests as a viable approach in most of psychology is unwarranted. I’ll quote him at length here, because the core of his argument is embedded in some other stuff, and I don’t want to be accused of quoting out of context (note that I did excise one part of the quote, because I deal with it separately below):

Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.

When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” “¦ It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.

There are several relatively superficial claims Lakens makes in these paragraphs that are either wrong or irrelevant. I’ll take them up below, but let me first address the central claim, which is that, contrary to the argument I make in my paper, risky prediction in the Popper/Meehl/Mayo sense is actually a viable strategy in psychology.

It’s instructive to note that Lakens doesn’t actually provide any support for this assertion; his argument is entirely negative. That is, he argues that I haven’t shown severity to be impossible. This is a puzzling way to proceed, because the most obvious way to refute an argument of the form “it’s almost impossible to do X“ is to just point to a few garden variety examples where people have, in fact, successfully done X. Yet at no point in Lakens’s lengthy review does he provide any actual examples of severe tests in psychology—i.e., of cases where the observed result would be extremely implausible if the favored theory were false. This omission is hard to square with his insistence that severe testing is a perfectly sensible approach that many psychologists already use successfully. Hundreds of thousands of papers have been published in psychology over the past century; if an advocate of a particular methodological approach can’t identify even a tiny fraction of the literature that has successfully applied that approach, how seriously should that view be taken by other people?

As background, I should note that Lakens’s inability to give concrete examples of severe testing isn’t peculiar to his review of my paper; in various interactions we’ve had over the last few years, I’ve repeatedly asked him to provide such examples. He’s obliged exactly once, suggesting this paper, titled Ego Depletion Is Not Just Fatigue: Evidence From a Total Sleep Deprivation Experiment by Vohs and colleagues.

In the sole experiment Vohs et al. report, they purport to test the hypothesis that ego depletion is not just fatigue (one might reasonably question whether there’s any non-vacuous content to this hypothesis to begin with, but that’s a separate issue). They proceed by directing participants who either have or have not been deprived of sleep to suppress their emotions while viewing disgusting video clips. In a subsequent game, they then ask the same participants to decide (seemingly incidentally) how loud a noise to blast an opponent with—a putative measure of aggression. The results show that participants who suppressed emotion selected louder volumes than those who did not, whereas the sleep deprivation manipulation had no effect.

I leave it as an exercise to the reader to decide for themselves whether the above example is a severe test of the theoretical hypothesis. To my mind, at least, it clearly isn’t; it fits very comfortably into the category of things that Meehl and Lakatos had in mind when discussing the near-total disconnect between verbal theories and purported statistical evidence. There are dozens, if not hundreds, of ways one might obtain the predicted result even if the theoretical hypothesis Vos et al. articulate were utterly false (starting from the trivial observation that one could obtain the pattern the authors reported even if the two manipulations tapped exactly the same construct but were measured with different amounts of error). There is nothing severe about the test, and to treat it as such is to realize Meehl and Lakatos’s worst fears about the quality of hypothesis-testing in much of psychology.

To be clear, I did not suggest in my paper (nor am I here) that severe tests are impossible to construct in psychology. I simply observed that they’re not a realistic goal in most domains, particularly in “soft“ areas (e.g., social psychology). I think I make it abundantly clear in the paper that I don’t see this as a failing of psychologists, or of their favored philosophy of science; rather, it’s intrinsic to the domain itself. If you choose to study extremely complex phenomena, where any given behavior is liable to be a product of an enormous variety of causal factors interacting in complicated ways, you probably shouldn’t expect to be able to formulate clear law-like predictions capable of unambiguously elevating one explanation above others. Social psychology is not physics, and there’s no reason to think that methodological approaches that work well when one is studying electrons and quarks should also work well when one is studying ego depletion and cognitive dissonance.

As for the problematic minor claims in the paragraphs I quoted above (you can skip down to the “Severity II“ section you’re bored or short on time)… First, the citations to Cohen, Lykken, and Meehl contain well-developed arguments to the same effect as my claim that “there are pervasive and typically very plausible competing explanations for almost every finding“. These arguments do not depend on what one means by “crud“, which is the subject of Orben & Lakens (2019). The only point relevant to my argument is that outcomes in psychology are overwhelmingly determined by many factors, so that it’s rare for a hypothesized effect in psychology to have no plausible explanation other than the authors’ preferred theoretical hypothesis. I think this is self-evidently true, and needs no further justification. But if you think it does require justification, I invite you to convince yourself of it in the following easy steps: (1) Write down 10 or 20 random effects that you feel are a reasonably representative sample of your field. (2) For each one, spend 5 minutes trying to identify alternative explanations for the predicted result that would be plausible even if the researcher’s theoretical hypothesis were false. (3) Observe that you were able to identify plausible confounds for all of the effects you wrote down. There, that was easy, right?

Second, it isn’t true that I stick to risky quantitative predictions. I explicitly note that risky predictions can be non-quantitative:

The canonical way to accomplish this is to derive from one’s theory some series of predictions—typically, but not necessarily, quantitative in nature—sufficiently specific to that theory that they are inconsistent with, or at least extremely implausible under, other accounts.

I go on to describe several potential non-quantitative approaches (I even cite Lakens!):

This does not mean, however, that vague directional predictions are the best we can expect from psychologists. There are a number of strategies that researchers in such fields could adopt that would still represent at least a modest improvement over the status quo (for discussion, see Meehl, 1990). For example, researchers could use equivalence tests (Lakens, 2017); predict specific orderings of discrete observations; test against compound nulls that require the conjunctive rejection of many independent directional predictions; and develop formal mathematical models that posit non-trivial functional forms between the input and ouput (Marewski & Olsson, 2009; Smaldino, 2017).

Third, what Lakens refers to as “triangulation“ is, as far as I can tell, conceptually akin to a logical conjunction of effects suggested above, so again, it’s unfair to say that I oppose this idea. I support it—in principle. However, two points are worth noting. First, the practical barrier to treating conjunctive rejections as severe tests is that it requires researchers to actually hold their own feet to the fire by committing ahead of time to the specific conjunction that they deem a severe test. It’s not good enough to state ahead of time that the theory makes 6 predictions, and then, when results reveal that the theory only confirms 4 of the predictions, to generate some post-hoc explanation for the 2 failed predictions while still claiming that the theory managed to survive a critical test.

Second, as we’ve already seen, the mere fact that a researcher believes a test is severe does not actually make it so, and there are good reasons to worry that many researchers grossly underestimate the degree of actual support a particular statistical procedure (or conjunction of procedures) actually confers on a theory. For example, you might naively suppose that if your theory makes 6 independent directional predictions—implying a probability of 2^6, or 1.5%, of getting all 6 right purely by chance—then joint corroboration of all your predictions provides strong support for your theory. But this isn’t generally the case, because many plausible competing accounts in psychology will tend to generate similarly-signed predictions. As a trivial example, when demand characteristics are present, they will typically tend to push in the direction of the researcher’s favored hypotheses.

The bottom line is that, while triangulation is a perfectly sensible strategy in principle, deploying it in a way that legitimately produces severe tests of psychological theories does not seem any easier than the other approaches I mention—nor, again, does Lakens seem able to provide any concrete examples.

Severity II

Lakens’s second argument regarding severity (or my alleged lack of respect for it) is that I put the cart before the horse: whereas I focus largely on the generalizability of claims made on the basis of statistical evidence, Lakens argues that generalizability is purely an instrumental goal, and that the overarching objective is severity. He writes:

I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests.

And:

Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.

As a purported criticism of my paper, I find this an unusual line of argument, because not only does it not contradict anything I say in my paper, it actually directly affirms it. In effect, Lakens is saying yes, of course it matters whether the statistical model you use maps onto your verbal hypothesis; how else would you be able to formulate a severe test of the hypothesis using inferential statistics? Well, I agree with him! My only objection is that he doesn’t follows his own argument far enough. He writes that “generalization as a means to severely test a prediction is common“, but he’s being too modest. It isn’t just common; for studies that use inferential statistics, it’s universal. If you claim to be using statistical results to test your theoretical hypotheses, you’re obligated to care about the alignment between the universes of observations respectively defined by your verbal and statistical hypotheses. As I’ve pointed out at length above, this isn’t a matter of philosophical disagreement (i.e., of some imaginary “inherent conflict between the deductive approaches and induction“); it’s definitional. Inferential statistics is about generalizing from samples to populations. How could you possibly assert that a statistical test of a hypothesis is severe if you have no idea whether the population defined by your statistical model aligns with the one defined by your verbal hypothesis? Can Lakens provide an example of a severe statistical test that doesn’t require one to think about what population of observations a model applies to? I very much doubt it.

For what it’s worth, I don’t think the severity of hypothesis testing is the only reason to worry about the generalizability of one’s statistical results. We can see this trivially, inasmuch as severity only makes sense in a hypothesis testing context, whereas generalizability matters any time inferential statistics (which make reference to some idealized population) are invoked. If you report a p-value from a linear regression model, I don’t need to know what hypothesis motivated the analysis in order to interpret the results, but I do need to understand what universe of hypothetical observations the statistical model you specified refers to. If Lakens wants to argue that statistical results are uninterpretable unless they’re presented as confirmatory tests of an a priori hypothesis, that’s his prerogative (though I doubt he’ll find many takers for that view). At the very least, though, it should be clear that his own reasoning gives one more, and not less, reason to take the arguments in my paper seriously.

Hopelessly impractical

[Attention conservation notice: the above two criticisms are the big ones; you can safely stop reading here without missing much. The stuff below is frankly more a reflection of my irritation at some of Lakens’s rhetorical flourishes than about core conceptual issues.]

A third theme that shows up repeatedly in Lakens’s review is the idea that the arguments I make, while perhaps reasonable from a technical standpoint, are far too onerous to expect real researchers to implement. There are two main strands of argument here. Both of them, in my view, are quite wrong. But one of them is wrong and benign, whereas the other is wrong and possibly malignant.

Impractical I

The first (benign) strand is summarized by Lakens’s Point 3, which he titles theories and tests are not perfectly aligned in deductive approaches. As we’ll see momentarily, “perfectly“ is a bit of a weasel word that’s doing a lot of work for Lakens here. But his general argument is that you only need to care about the alignment between statistical and verbal specifications of a hypothesis if you’re an inductivist:

To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.

I’ve already spent several thousand words above explaining why this is simply false. To recap (I know I keep repeating myself, but this really is the crux of the whole issue): if you’re going to report inferential statistics and claim that they provide support for your verbal hypotheses, then you’re obligated to care about the correspondence between the test and the theory. This doesn’t require some overarching inductivist philosophy of science (which is fortunate, because I don’t hold one myself); it only requires you to believe that when you make statements of the form “statistic X provides evidence for verbal claim Y“, you should be able to explain why that’s true. If you can’t explain why the p-value (or Bayes Factor, etc.) from that particular statistical specification supports your verbal hypothesis, but a different specification that produces a radically different p-value wouldn’t, it’s not clear why anybody else should take your claims seriously. After all, inferential statistics aren’t (or at least, shouldn’t be) just a kind of arbitrary numerical magic we sprinkle on top of our words to get people to respect us. They mean things. So the alternative to caring about the relationship between inferential statistics and verbal claims is not, as Lakens seems to think, deductivism—it’s ritualism.

The tacit recognition of this point is presumably why Lakens is careful to write that “theories and tests are not perfectly aligned in deductive approaches“ (my emphasis). If he hadn’t included the word “perfectly“, the claim would seem patently silly, since theories and tests obviously need to be aligned to some degree no matter what philosophical view one adopts (save perhaps for outright postmodernism). Lakens’s argument here only makes any sense if the reader can be persuaded that my view, unlike Lakens’, demands perfection. But it doesn’t (more on that below).

Lakens then goes on to address one of the central planks of my argument, namely, the distinction between fixed and random factors (which typically has massive implications for the p-values one observes). He suggests that while the distinction is real, it’s wildly unrealistic to expect anybody to actually be able to respect it:

If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed effects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned.

You don’t need to read Paul Meehl’s Big Book of Logical Fallacies to see that Lakens is equivocating. He equates wanting to generalize beyond one’s sample with wanting to generalize “to all possible random factors“—as if the only two possible interpretations of an effect are that it either generalizes to all conceivable scenarios, or that it can’t be generalized beyond the sample at all. But this just isn’t true; saying that researchers should build statistical models that reflect their generalization intentions is not the same as saying that every mixed-effects model needs to include all variance components that could conceivably have any influence, however tiny, on the measured outcomes. Lakens presents my argument as a statistically pedantic, technically-correct-but-hopelessly-ineffectual kind of view—at which point it’s supposed to become clear to the reader that it’s just crazy to expect psychologists to proceed in the way I recommend. And I agree that it would be crazy—if that was actually what I was arguing. But it isn’t. I make it abundantly clear in my paper that aligning verbal and statistical hypotheses needn’t entail massive expansion of the latter; it can also (and indeed, much more feasibly) entail contraction of the former. There’s an entire section in the paper titled Draw more conservative inferences that begins with this:

Perhaps the most obvious solution to the generalizability problem is for authors to draw much more conservative inferences in their manuscripts—and in particular, to replace the hasty generalizations pervasive in contemporary psychology with slower, more cautious conclusions that hew much more closely to the available data. Concretely, researchers should avoid extrapolating beyond the universe of observations implied by their experimental designs and statistical models. Potentially relevant design factors that are impractical to measure or manipulate, but that conceptual considerations suggest are likely to have non-trivial effects (e.g., effects of stimuli, experimenter, research site, culture, etc.), should be identified and disclosed to the best of authors’ ability.

Contra Lakens, this is hardly an impractical suggestion; if anything, it offers to reduce many authors’ workload, because Introduction and Discussion sections are typically full of theoretical speculations that go well beyond the actual support of the statistical results. My prescription, if taken seriously, would probably shorten the lengths of a good many psychology papers. That seems pretty practical to me.

Moreover—and again contrary to Lakens’s claim—following my prescription would also dramatically reduce uncertainty rather than increasing it. Uncertainty arises when one lacks data to inform one’s claims or beliefs. If maximal certainty is what researchers want, there are few better ways to achieve that than to make sure their verbal claims cleave as closely as possible to the boundaries implicitly defined by their experimental procedures and statistical models, and hence depend on fewer unmodeled (and possibly unknown) variables.

Impractical II

The other half of Lakens’s objection from impracticality is to suggest that, even if the arguments I lay out have some merit from a principled standpoint, they’re of little practical use to most researchers, because I don’t do enough work to show readers how they can actually use those principles in their own research. Lakens writes:

The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic.“

And:

As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why.

And:

Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice.

There are many statements in Lakens’s review that made me shake my head, but the argument advanced in the above quotes is the only one that filled me (briefly) with rage. In part that’s because parts of what Lakens says here blatantly misrepresent my paper. For example, he writes that “Yarkoni just recommends “˜more expansive models’“, which is frankly a bit insulting given that I spend a full third of my paper talking about various ways to address the problem (e.g., by designing studies that manipulate many factors at once; by conducting meta-analyses over variance components; etc.).

Similarly, Lakens implies that Barr et al. (2013) gives better versions of my arguments, when actually the two papers are doing completely different things. Barr et al. (2013) is a fantastic paper, but it focuses almost entirely on the question of how one should specify and estimate mixed-effects models, and says essentially nothing about why researchers should think more carefully about random factors, or which ones researchers ought to include in their model. One way to think about it is that Barr et al. (2013) is the paper you should read after my paper has convinced you that it actually matters a lot how you specify your random-effects structure. Of course, if you’re already convinced of the latter (which many people are, though Lakens himself doesn’t seem to be), then yeah, you should maybe skip my paper““you’re not the intended audience.

In any case, the primary reason I found this part of Lakens’s review upsetting is that the above quotes capture a very damaging, but unfortunately also very common, sentiment in psychology, which is the apparent belief that somebody—and perhaps even nature itself—owes researchers easy solutions to extremely complex problems.

Lakens writes that “Yarkoni remains vague on which random factors should be included and which not“, and that “ It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models”. Well, on a superficial level, I agree with Lakens: I do remain vague on which factors should be included, and it would be lovely if I were able to say something like “here, Daniel, I’ve helpfully identified for you the five variance components that you need to care about in all your studies“. But I can’t say something like that, because it would be a lie. There isn’t any such one-size-fits-all prescription—and trying to pretend there is would, in my view, be deeply counterproductive. Psychology is an enormous field full of people trying to study a very wide range of complex phenomena. There is no good reason to suppose that the same sources of variance will assume even approximately the same degree of importance across broad domains, let alone individual research questions. Should psychophysicists studying low-level visual perception worry about the role of stimulus, experimenter, or site effects? What about developmental psychologists studying language acquisition? Or social psychologists studying cognitive dissonance? I simply don’t know.

One reason I don’t know, as I explain in my paper, is that the answer depends heavily on what conclusions one intends to draw from one’s analyses—i.e., on one’s generalization intentions. I hope Lakens would agree with me that it’s not my place to tell other people what their goal should be in doing their research. Whether or not a researcher needs to model stimuli, sites, tasks, etc. as random factors depends on what claim they intend to make. If a researcher intends to behave as if their results apply to a population of stimuli like the ones one used in their study, and not just to the exact sampled stimuli, then they should use a statistical model that reflects that intention. But if they don’t care to make that generalization, and are comfortable drawing no conclusions beyond the confines of the tested stimuli, then maybe they don’t need to worry about explicitly modeling stimulus effects at all. Either way, what determines whether or not a statistical model is or isn’t appropriate is whether or not that model adequately captures what a researcher claims it’s capturing—not whether Tal Yarkoni has data suggesting that, on average, site effects are large in one area of social psychology but not large in another area of psychophysics.

The other reason I can’t provide concrete guidance about what factors psychologists ought to model as random is that attempting to establish even very rough generalizations of this sort would involve an enormous amount of work—and the utility of that work would be quite unclear, given how contextually specific the answers are likely to be. Lakens himself seems to recognize this; at one point in his review, he suggests that the topic I address “probably needs a book length treatment to do it justice.“ Well, that’s great, but what are working researchers supposed to do in the meantime? Is the implication that psychologists should feel free to include whatever random effects they do or don’t feel like in their models until such time as someone shows up with a compendium of variance component estimates that apply to different areas of psychology? Does Lakens also dismiss papers seeking to convince people that it’s important to consider statistical power when designing studies, unless those papers also happen to provide ready-baked recommendations for what an appropriate sample size is for different research areas within psychology? Would he also conclude that there’s no point in encouraging researchers to define “smallest effect sizes of interest“, as he himself has done in the past, unless one can provide concrete recommendations for what those numbers should be?

I hope not. Such a position would amount to shooting the messenger. The argument in my paper is that model specification matters, and that researchers need to think about that carefully. I think I make that argument reasonably clearly and carefully. Beyond that, I don’t think it’s my responsibility to spend the next N years of my own life trying to determine what factors matter most in social, developmental, or cognitive psychology, just so that researchers in those fields can say, “thanks, your crummy domain-general estimates are going to save me from having to think deeply about what influences matter in my own particular research domain“. I think it’s every individual researcher’s job to think that through for themselves, if they expect to be taken seriously.

Lastly, and at the risk of being a bit petty (sorry), I can’t resist pointing out what strikes me as a rather serious internal contradiction between Lakens’s claim that my arguments are unhelpful unless they come with pre-baked variance estimates, and his own stated views about severity. On the one hand, Lakens claims that psychologists ought to proceed by designing studies that subject their theoretical hypotheses to severe tests. On the other hand, he seems to have no problem with researchers mindlessly following field-wide norms when specifying their statistical models (e.g., modeling only subjects as random effects, because those are the current norms). I find these two strands of thought difficult to reconcile. As we’ve already seen, the severity of a statistical procedure as a test of a theoretical hypothesis depends on the relationship between the verbal hypothesis and the corresponding statistical specification. How, then, could a researcher possibly feel confident that their statistical procedure constitutes a severe test of their theoretical hypothesis, if they’re using an off-the-shelf model specification and have no idea whether they would have obtained radically different results if they had randomly sampled a different set of stimuli, participants, experimenters, or task operationalizations?

Obviously, it can’t. Having to think carefully about what the terms in one’s statistical model mean, how they relate to one’s theoretical hypothesis, and whether those assumptions are defensible, isn’t at all “impractical“; it’s necessary. If you can’t explain clearly why a model specification that includes only subjects as random effects constitutes a severe test of your hypothesis, why would you expect other people to take your conclusions at face value?

Trouble with titles

There’s one last criticism Lakens raises in his review of my paper. It concerns claims I make about the titles of psychology papers:

This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?

I was initially going to respond to this in detail, but ultimately decided against it, because (a) by Lakens’ own admission, it’s a minor concern; (b) this is already very long as-is; and (c) while it’s a minor point in the context of my paper, I think this issue has some interesting and much more general implications for how we think about titles. So I’ve decided I won’t address it here, but will eventually take it up in a separate piece that gives it a more general treatment, and that includes a kind of litmus test one can use to draw reasonable conclusions about whether or not a title is appropriate. But, for what it’s worth, I did do a sweep through the paper in the process of revision, and have moderated some of the language.

Conclusion

Daniel Lakens argues that psychologists don’t need to care much if at all about the relationship between their statistical model specifications and their verbal hypotheses, because hypothesis testing in psychology proceeds deductively: researchers generate predictions from their theories, and then update their confidence in their theories on the basis of whether or not those predictions are confirmed. This all sounds great until you realize that those predictions are almost invariably evaluated using inferential statistical methods that are inductive by definition. So long as psychologists are relying on inferential statistics as decision aids, there can be no escape from induction. Deduction and induction are not competing philosophies or approaches; the standard operating procedure in psychology is essentially a hybrid of the two.

If you don’t like the idea that the ability to appraise a verbal hypothesis using statistics depends critically on the ability to understand and articulate how the statistical terms map onto the verbal ideas, that’s fine; an easy way to solve that problem is to just not use inferential statistics. That’s a perfectly reasonable position, in my view (and one I discuss at length in my paper). But once you commit yourself to relying on things like p-values and Bayes Factors to help you decide what you believe about the world, you’re obligated to think about, justify, and defend your statistical assumptions. They aren’t, or shouldn’t be, just a kind of pedantic technical magic you can push-button sprinkle on top of your favorite verbal hypotheses to make them really stick.

if natural selection goes, so does most everything else

Jerry Fodor and Massimo Piattelli-Palmarini have a new book out entitled What Darwin Got Wrong. The book hasto put it gentlynot been very well received (well, the creationists love it). Its central thesis is that natural selection fails as a mechanism for explaining observable differences between species, because there’s ultimately no way to conclusively determine whether a given trait was actively selected for, or if it’s just a free-rider that happened to be correlated with another trait that truly was selected for. For example, we can’t really know why polar bears are white: it could be that natural selection favored white fur because it allows the bears to blend into their surroundings better (presumably improving their hunting success), or it could be that bears with sharper teeth happen to have white fur, or that smaller, less energetic bears who need to eat less often tend to have white fur, or that a mutant population of polar bears who happened to be white also happened to have a resistance to some deadly disease that wiped out all non-white polar bears, or… you get the idea.

If this sounds like pretty silly reasoning to you, you’re not alone. Virtually all of the reviews (or at least, those written by actual scientists) have resoundingly panned Fodor and Piattelli-Palmarini for writing a book about evolution with very little apparent understanding of evolution. Since I haven’t read the book, and can’t claim much knowledge of evolutionary biology, I’m not going to weigh in with a substantive opinion, except to say that, based on the reviews I’ve read, along with an older article of Fodor’s that makes much the same argument, I don’t see any reason to disagree with the critics. The most elegant critique I’ve come across is Block and Kitcher’s review of the book in the Boston Review:

The basic problem, according to Fodor and Piattelli-Palmarini, is that the distinction between free-riders and what they ride on is “invisible to natural selection.“ Thus stated, their objection is obscure because it relies on an unfortunate metaphor, introduced by Darwin. In explaining natural selection, the Origin frequently resorts to personification: “natural selection is daily and hourly scrutinising, throughout the world, every variation, even the slightest“ (emphasis added). When they talk of distinctions that are “invisible“ to selection, they continue this personification, treating selection as if it were an observer able to choose among finely graded possibilities. Central to their case is the thesis that Darwinian evolutionary theory must suppose that natural selection can make the same finely graded discriminations available to a human (or divine?) observer.

Neither Darwin, nor any of his successors, believes in the literal scrutiny of variations. Natural selection, soberly presented, is about differential success in leaving descendants. If a variant trait (say, a long neck or reduced forelimbs) causes its bearer to have a greater number of offspring, and if the variant is heritable, then the proportion of organisms with the variant trait will increase in subsequent generations. To say that there is “selection for“ a trait is thus to make a causal claim: having the trait causes greater reproductive success.

Causal claims are of course familiar in all sorts of fields. Doctors discover that obesity causes increased risk of cardiac disease; atmospheric scientists find out that various types of pollutants cause higher rates of global warming; political scientists argue that party identification is an important cause of voting behavior. In each of these fields, the causes have correlates: that is why causation is so hard to pin down. If Fodor and Piattelli-Palmarini believe that this sort of causal talk is “conceptually flawed“ or “incoherent,“ then they have a much larger opponent then Darwinism: their critique will sweep away much empirical inquiry.

This really seems to me to get at the essence of the claim, and why it’s silly. Fodor and Piattelli-Palmarini are essentially claiming that natural selection is bunk because you can never be absolutely sure that natural selection operated on the trait you think it operated on. But scientists don’t require absolute certainty to hold certain beliefs about the way the world works; we just require that those beliefs seem somewhat more plausible than other available alternatives. If you take absolute certainty as a necessary criterion for causal inference, you can’t do any kind of science, period.

It’s not just evolutionary biology that suffers; if you held psychologists to the same standards, for example, we’d be in just as much trouble, because there’s always some potential confound that might explain away a putative relation between an experimental manipulation and a behavioral difference. If nothing else, you can always blame sampling error: you might think that giving your subjects 200 mg of caffeine was what caused them to have to go to the bathroom every fifteen minutes report decreased levels of subjective fatigue, but maybe you just happened to pick a particularly sleep-deprived control group. That’s surely no less plausible an explanation than some of the alternative accounts for the whiteness of the polar bear suggested above. But if you take this type of argument seriously, you can pretty much throw any type of causal inference (and hence, most science) out the window. So it’s hardly surprising that Fodor and Piattelli-Palmarini’s new book hasn’t received a particularly warm reception. Most of the critics are under the impression that science is a pretty valuable enterprise, and seems to work reasonably well most of the time, despite the rampant uncertainty that surrounds most causal inferences.

Lest you think there must be some subtlety to Fodor’s argument the critics have missed, or that there’s some knee-jerk defensiveness going on on the part of, well, damned near every biologist who’s cared to comment, I leave you with this gem, from a Salon interview with Fodor (via Jerry Coyne):

Creationism isn’t the only doctrine that’s heavily into post-hoc explanation. Darwinism is too. If a creature develops the capacity to spin a web, you could tell a story of why spinning a web was good in the context of evolution. That is why you should be as suspicious of Darwinism as of creationism. They have spurious consequence in common. And that should be enough to make you worry about either account.

I guess if you really believed that every story you could come up with about web-spinning was just as good as any other, and that there was no way to discriminate between them empirically (a notion Coyne debunks), this might seem reasonable. But then, you can always make up just-so stories to fit any set of facts. If you don’t allow for the fact that some stories have better evidential support than others, you indeed have no way to discriminate creationism from science. But I think it’s a sad day if Jerry Fodor–who’s made several seminal contributions to cognitive science and the philosophy of science–really believes that.