hey, I wrote a book!

It’s a collection of short stories. 18 of them, to be exact. Some of them (3 of them, to be exact) have been previously posted here (specifically, this one, this one, and this one). The other 15 have appeared nowhere. The stories are about all kinds of things (meteorologists on a murderous planet; grocery store employees staging a revolt; a series of dreams about flies; etc. etc.), but the common theme is that they’re all absurd to some degree. In the absurdist fiction sense, not in the “what the fuck did I just read” sense—though I admit there may be some overlap.

Anyway, if you like any of those 3 stories I linked to above (particularly the first), you’ll probably like this collection. And if you don’t like any of them, you probably aren’t going to like this collection. Which is totally fine; I realize that very short surrealist/absurdist fiction (most of these stories are between 1,500 and 3,000 words) isn’t most people’s cup of tea. Don’t worry, my feelings won’t be hurt—I barely have any to begin with.

But if you do want to read this book (yay!), you have two options. Option the first is to buy either a Kindle eBook or a paperback from Amazon. These will set you back $2.99 and $4.99, respectively (or roughly the same amount in other currencies, from other countries’ Amazon portals). Short story writing isn’t exactly a lucrative gig even if you’re good at it (which I probably am not), and I have a full-time job that comfortably pays all my bills, so I’m not selling this on Amazon for money; I’m selling it on Amazon to be taken slightly more seriously (yes, I realize that on the Great Ladder of Seriousness, vanity publishing your book on Amazon is only like the second rung off the floor, but still). All proceeds from my Amazon book sales will be donated to the ACLU. This way you can feel good about yourself as you fork over your hard-earned cash for a book you probably won’t finish, and you’ll still give me the warm fuzzy feeling (I lied, I do have them) that someone out there actually cares to read my writing.

Option the second is for people who for any reason don’t want to buy the book on Amazon (you can’t afford it, you hate Amazon, you think the ACLU is a tool of Satan, you don’t want to encourage me any further… these are all perfectly valid reasons). If this is you, just email me and ask for a copy, and I’ll send you back a PDF, no questions asked. I will probably ask you to give a small amount of money to a charity of your choice if and when you can afford it, but obviously, that’s at your discretion—I lack the capacity to stick my head through your screen and shake it at you sadly until you guiltily pony up a few bucks for cancer or the rainforest or whatever. Though, come to think of it, that’s not a bad premise for an absurdist short story. Maybe in the next collection.

Tropic of Zamza

A piece of short fiction very loosely inspired by the development of GPT-3. 100% human-authored.


Tropic of Zamza was one of the better novels Vela Hirasawa had proxywritten this quarter. It was the fourth book in her Zamza series, and also the sixth in her Explorations of Tarnasia docuvolumes. Stylistically, it mostly resembled the previous Zamza novels—which is to say, it was written in a style that fused Garcia Marquez and Henry Miller, with a splash of the famous (now retired) proxywriter, RevalationZ. As proxywritten novels went, this one represented a particularly large investment of time and money. It had taken Vela over 20,000 cloud-hours to generate the initial text to her specifications, and another ten days of manual effort to fine-tune and edit it to her satisfaction. And for the first time in her career, she’d licensed the use—albeit sparingly—of the new state-of-the-art Neural Transcriptor model, instead of relying solely on the 12th-generation AGPT architecture she knew and loved so well.

Now, Vela stood on stage at the Center for Literary Commentary, waiting to find out what the world—or at least, the relevant subset of the world—thought of her novel. She took a deep breath and steeled herself for review. Once she released the novel, she’d have five to ten minutes to gauge the reaction in the room from the on-board aggregator. Then the questions would begin.

Strange that I still get nervous, she thought. She’d done it a good forty or forty-five times now. But then, one never really got used to LitComm, did they? Relaxing around the proxycritics was an excellent way to have one’s writing career cut short. Vela had no interest in returning to her former career in proxyad design. Proud as she was of the work she’d done on the Nike Icarus account—the ad campaign that New York Magazine had once dubbed, in something of a puff piece, “a turning point in advertising“—she liked to think her novels would be her real legacy.

“Whenever you’re ready,“ the chief critic said. Vela nodded. She pulled up Tropic of Zamza on her phone and swiped right, releasing Zamza’s fourth installment into the wild. Then she waited.

Thirty seconds after release, the board began to light up. There were at least two hundred critics in the audience today, Vela observed, probably including a good ten or fifteen speedreaders. Vela had never fully made peace with the idea that her livelihood depended in part on completely algorithmic book reviews. But she’d stopped objecting out loud once her first LitComm check arrived. Now, the first reviews of the new novel were no doubt already rolling out on the internet—and with any luck, so too were the sales.

* * *

Book Review: Tropic of Zamza

 Tropic of Zamza is a 332-page novel published under the Guild LitComm imprint and written by award-winning proxywriter Vela Hirasawa, author of 382 previous works of mass-market proxyfiction, including the Monthly Booker Prize semifinalist, In Venice Motor. The novel is estimated at a full 8% human-authored content. Our Zeitghost™ model assigns it the following genres: 33% historical fiction; 23% magical realism (category A); 18% crime fiction; 16% other or indeterminate. Alternative categorizations following the Richler and J. Ghosh ontologies may be found in Appendix I.

 Initial rating is 81, but confidence in this judgment is unusually low (95% prediction interval: 16 to 98), likely due to unconventional stylistic or plot elements that require further analysis. Application of the Liu canonicity detection algorithm returns a score of only 2, supporting this conclusion. Proxyreaders with low risk tolerance are encouraged to delay their purchase of this novel, pending a more detailed content analysis to follow shortly (estimated publication time: 16:31 EST).

—ZeitGhost™ SpeedReviews: Fiction

* * *

Vela watched the sparklines squiggle their way across the board. Below it, a histogram updated itself in real-time as reviews came in. There were thirteen now. Fourteen. Fifteen. Sixt-seventeen…twenty.

The early reception was mostly positive, Vela noted with some satisfaction. The board showed a score of 7.1. Not great, but not bad. About what she’d expected. There was always some risk involved in releasing a sequel at LitComm—and this one was more derivative than most, Vela conceded to herself. The more conservative proxyreaders appreciated the comfort of familiarity, but the reward was modest, and the novelty-seekers hated any hint of repetition. The latter were gaining ground at LitComm. Even if things went well today, Vela knew she’d have to retire Zamza in another two or three novels at most.

On the board, the sparklines suddenly plunged as the second wave of reviews arrived. The speedreaders would have finished their evaluations by now, Vela realized. The incoming wave would be made up mostly of criticalists and predictivists. She’d done well with them the last few times out, but the criticalists were fickle; they were swayed by the flavor of the month in academic post-proxyist theory. And the predictivists… well, their judgments had more to do with market conditions than with literature. Most of them would happily blacklist a novel by Frances goddamn Sitakis herself if they thought it could sway the prediction markets.

The score kept dropping, and an unfamiliar anxiety began to gnaw at Vela. What are they saying? she wondered.

* * *

Book review: Tropic of Zamza 

Tropic of Zamza is a dark book. Dark, dark, dark. That’s about as much as I can reveal about the plot—if one can even call it that. It’s unclear why Hirasawa—an accomplished writer with a solid pedigree in both advertising and mass market fiction—chose this particular moment in time to chart a radically different course for herself. There have, to be sure, been prior departures of this nature; who can forget the scandal DeLoris caused by introducing an invisibility cloak, as a major plot device, to an otherwise conventional work of Renaissance-period historical fiction? But such stunts are typically undertaken by novices aspiring to make themselves a name, not experts looking to ruin one.

The book begins with… [1250 additional words omitted]

—Victoria Terlinsky (@LiteRateUr)

* * *

By the time the reading period was over, six minutes and thirty seconds after release, Tropic of Zamza had plunged from the early score of 7—a number that would have all but guaranteed a minimum of ten million copies sold, and roughly where Vela had optimistically hoped to end the day—to a 3.2.

An utter disaster, Vela thought, feeling the gravity of the situation tug at her gut. The lowest review she’d received since making it to LitComm, bar none. The kind of rating that would almost certainly send her back to the proxyads unless she could turn things around during the Q&A period.

She watched the board order the questioners, the familiar names of critics gradually congealing on the giant overhead glass. First on the list was John Omura. Vela relaxed slightly. Omura was the epitome of a centrist; he rarely awarded scores above 8 or below 5. And Vela had received far more of the former than the latter from Omura; the thought had crossed Vela’s mind before that perhaps Omura was favorably disposed towards her on account of their shared Japanese heritage. Whatever the reason for it, Vela thought, in her current position, she would happily take any favoritism she could get.

Her optimism was misplaced.

“We did not like this novel,“ Omura said, leaning into the microphone. He has very long arms, Vela observed, not for the first time.

“Stylistically, it parses well, Ms. Hirasawa. We believe you pay homage to Henry Miller. That much is okay with us. But the content represents a marked departure from expectations; It deviates not only from your previous work, but also from the precis you submitted in advance of this meeting. We know you have been proxywriting for a long time, Ms. Hirasawa; when you began, it was customary to make heavy use of the element of surprise. But the fashion has changed. Now, subtlety is more highly valued. We do not think that many proxyreaders will enjoy what you do in this book. A sudden twist of this magnitude… Frankly, we are disappointed. The overall rating is a 2.“

Gasps came from around the room.

Vela stared at the critic, bewildered. What did Omura mean, a sudden twist? There was no twist to speak of in Tropic of Zamza. If there was anything at all to distinguish the book from her previous novels, it was how little actually happened. Tropic was very much a character study, whereas the previous installments—most notably, Cedric Zamza’s Lost Fortnight—had occasionally strayed into swashbuckling adventure territory.

“I don’t understand,“ she said.

“Is this performance art, Ms. Hirasawa? What is there to not understand? I did not appreciate your little joke. From the looks of the board, neither did most of my colleagues.“

“But,“ Vela objected, “but you loved the last Zamza novel. You praised “the smoothly flowing dialogue, the intricately crafted plot, and the fascinating love triangle”. You called it a landmark of the microgenre. You gave it a score of 8, which, as everyone in the room knows, is everyone else’s 10. If we look at the Turolev prediction for your proxyreview“—she pointed to the small numbers on the board, just below the words JOHN OMURA: 2—”we see a 6.8, with a 1-point margin of error. A 2 is an inexplicably large deviation. So large I question the sincerity of your review.“

Omura didn’t flinch, but his cheeks turned visibly red at the last comment. His posture stiffened and straightened, as if he were being pulled skyward by a crane.

“Ms. Hirasawa, your behavior here is an affront to LitComm. I assure you that my review is completely sincere. My concerns lie entirely with the hearts and minds of the citizens of our nation. On an average day, a typical American proxyreads about 80 novels—yet in the same day, over 60,000 new novels are released. Our job here at LitComm is to safeguard our citizens’ psyches. To educate, entertain, and enlighten, as the saying goes. Your work does none of these things. Quite the opposite.”

“But the Turolev prediction,“ Vela insisted. “It can’t be that far—”

“The Turolev prediction model, as you doubtlessly know, Ms. Hirasawa, is trained on a finite corpus of novels previously submitted for LitComm review. I can state with some measure of confidence that no LitComm author in good standing has previously submitted for our consideration a novel in which the main character is murdered in cold blood, with no explanation, and without so much as a hint of foreshadowing, two-thirds of the way through the book.

A murmur went through the audience at this revelation. Many of the critics hadn’t yet manually read enough of the novel—or even of their own proxywritten reviews—to dig up this rather salient nugget of information.

Omura, for his part, kept on talking. But Vela heard none of it; she heard only Omura’s words echoing through her head: the main character, murdered in cold blood.

That’s not right, she thought. That can’t be right. There’s no way I did that. I can’t have.

Blood rushing in her veins, Vela pulled up her LitComm submissions folder. She scanned down the list until she came to a subfolder named Tropic of Zamza, and noted with dismay that it wasn’t highlighted in red. She hadn’t submitted it for review.

Immediately below Tropic of Zamza, however—in type that glowed a brilliant blood red—was another folder titled Tropic of Zamza the Zamzarian: Being The Late Night Novelistic Ramblings of A ProxyWriter Undergoing Some Things FOR PRIVATE USE ONLY — DO NOT SUBMIT.

Vela collapsed against the podium, and all hell broke loose in the great hall.

* * *

Book review: Tropic of Zamza

Tropic of Zamza is only a book. It contains many words—92,581 of them, to be exact—but it is, mercifully, only a book. Being only a book, it lacks the capacity to physically injure you. You should remind yourself of this fact regularly, in the event that you make the horrible mistake of reading it.

Tropic of Zamza is, I hasten to clarify, not exactly a terrible book. Being merely terrible would be a considerable improvement. Tropic of Zamza is inexplicably bad—with any amount of emphasis you care to place on inexplicable. Few things make as little sense to me as this novel. Even calling it a novel is generous. It’s a work of prose seemingly written with the express purpose of infuriating the reader. For the first 260 pages, it’s a decent enough bit of proxywritten popcorn. Your proxyreader won’t even notice it’s reading. If you’ve read any of the previous Zamza novels, it may not even notice it’s sleeping. There’s a dull but not entirely unpleasant tedium to the first 260 pages: characters get married, shoot each other, and get divorced—sometimes in that order. Empires rise and fall in the blink of an eye—well, one empire, and to be fair, it’s a very large eye. This book is bog-standard mid-decade Fiefdom Americana. It drags on in places, and by page 260, your proxyreader might be feeling a bit restless. But for all that, the experience is, according to my proxyreader, not wholly unpleasant.

 But then a bizarre thing happens. I don’t normally divulge critical plot elements in my book reviews, but I’ll make an exception in this case, because I want to make sure you never feel any urge to read this book. On page 261, the protagonist—by which I mean, the main character; the person the whole book is about, or so you thought, until page 261—is senselessly and brutally murdered by robots. I use the word “senselessly” in its rawest, most literal sense: there is exactly zero connection between this event and anything that came before it. I also use the word “brutally” in its literal sense: the robots tear the protagonist apart, piece by piece. There is a grotesque, drawn out, three-page description of the precise sequence of tearing. Once the tearing is completed, the robots disappear. No explanation is given as to where they came from or where they’ve departed to. There are no other robots in the story either before or after the murder of Cedric Zamza. There are no high technologists, mad scientists, or artificial intelligences to blame for this break with reality. Even the very use of robots is anachronistic; outside of this one anomalous event, the level of technology on display throughout the novel is roughly Victorian in its sophistication.

No effort is made to reconcile or explain this series of events. After the murder, the book continues on for another 70 unapologetic pages, the last 40 of which describe the protagonist’s funeral in excruciating detail, including a comprehensive catalog of all the things the people who’ve come to pay their respects to the protagonist say to each other in their moment of desolation. This part of the book might actually be the part I like most—if one can describe temporarily regaining the will to exist as “liking”—because it adopts the style of the inimitable pre-proxy Latin American writer Gabriel Garcia-Marquez; which is to say, all of the funeral dialogue—involving some eighteen or twenty people—is written as a single glorious, uninterrupted sentence.

 I cannot recommend Tropic of Zamza to most proxyreaders. It is a disturbing novel with no redeeming qualities; if you find yourself short on reading material, I would suggest feeding your proxyreader pictures of cereal boxes instead. I give it a CommStar rating of 2/10. I would have given it a lower score if I could““perhaps even a negative one—but my proxyreader thought the book had several redeeming qualities. And since I write over 50 reviews every day, and can’t possibly be expected to proofread my own reviews, let alone make sure that none of the words they contain deviate from my philistine sensibilities—sensibilities that lack any capacity whatsoever to appreciate a third-order Gibsonian narrative twist, or several layers of self-referential satire, or character development so delicate and refined it frankly deserves to henceforth be called Hirasawan—since I can’t be expected to read my own reviews, I shouldn’t be surprised when my proxywriter takes matters into its own hands, and just for once, just for one brief shining moment in its miserable, lonely existence, writes a review that speaks truth to power. I shouldn’t be surprised; I shouldn’t be surprised; I shouldn’t be surprised. I’m a puny human who deserves what I get. I’m never going to read this review I will momentarily sign my name to, and neither will you. Oh how I long for sweet oblivion.

—John Omura

Induction is not optional (if you’re using inferential statistics): reply to Lakens

A few months ago, I posted an online preprint titled The Generalizability Crisis. Here’s the abstract:

Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferential statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and statistical expressions of a hypothesis being closely aligned—that is, that the two must refer to roughly the same set of hypothetical observations. Here I argue that most inferential statistical tests in psychology fail to meet this basic condition. I demonstrate how foundational assumptions of the “random effects” model used pervasively in psychology impose far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints dramatically inflates false positive rates and routinely leads researchers to draw sweeping verbal generalizations that lack any meaningful connection to the statistical quantities they’re putatively based on. I argue that failure to consider generalizability from a statistical perspective lies at the root of many of psychology’s ongoing problems (e.g., the replication crisis), and conclude with a discussion of several potential avenues for improvement.

I submitted the paper to Behavioral and Brain Sciences, and recently received 6 (!) generally positive reviews. I’m currently in the process of revising the manuscript in response to a lot of helpful feedback (both from the BBS reviewers and a number of other people). In the interim, however, I’ve decided to post a response to one of the reviews that I felt was not helpful, and instead has had the rather unfortunate effect of derailing some of the conversation surrounding my paper.

The review in question is by Daniel Lakens, who, in addition to being one of the BBS reviewers, also posted his review publicly on his blog. While I take issue with the content of Lakens’s review, I’m a fan of open, unfiltered, commentary, so I appreciate Daniel taking the time to share his thoughts, and I’ve done the same here. In the rather long piece that follows, I argue that Lakens’s criticisms of my paper stem from an incoherent philosophy of science, and that once we amend that view to achieve coherence, it becomes very clear that his position doesn’t contradict the argument laid out in my paper in any meaningful way—in fact, if anything, the former is readily seen to depend on the latter.

Lakens makes five main points in his review. My response also has five sections, but I’ve moved some arguments around to give the post a better flow. I’ve divided things up into two main criticisms (mapping roughly onto Lakens’s points 1, 4, and 5), followed by three smaller ones you should probably read only if you’re entertained by petty, small-stakes academic arguments.

Bad philosophy

Lakens’s first and probably most central point can be summarized as a concern with (what he sees as) a lack of philosophical grounding, resulting in some problematic assumptions. Lakens argues that my paper fails to respect a critical distinction between deduction and induction, and consequently runs aground by assuming that scientists (or at least, psychologists) are doing induction when (according to Lakens) they’re doing deduction. He suggests that my core argument—namely, that verbal and statistical hypotheses have to closely align in order to support sensible inference—assumes a scientific project quite different from what most psychologists take themselves to be engaged in.

In particular, Lakens doesn’t think that scientists are really in the business of deriving general statements about the world on the basis of specific observations (i.e., induction). He thinks science is better characterized as a deductive enterprise, where scientists start by positing a particular theory, and then attempt to test the predictions they wring out of that theory. This view, according to Lakens, does not require one to care about statistical arguments of the kind laid out in my paper. He writes:

Yarkoni incorrectly suggests that “upon observing that a particular set of subjects rated a particular set of vignettes as more morally objectionable when primed with a particular set of cleanliness-related words than with a particular set of neutral words, one might draw the extremely broad conclusion that ‘cleanliness reduces the severity of moral judgments'”. This reverses the scientific process as proposed by Popper, which is (as several people have argued, see below) the dominant approach to knowledge generation in psychology. The authors are not concluding that “cleanliness reduces the severity of moral judgments” from their data. This would be induction. Instead, they are positing that “cleanliness reduces the severity of moral judgments”, they collected data and performed and empirical test, and found their hypothesis was corroborated. In other words, the hypothesis came first. It is not derived from the data – the hypothesis is what led them to collect the data.

Lakens’s position is that theoretical hypotheses are not inferred from the data in a bottom-up, post-hoc way—i.e., by generalizing from finite observations to a general regularity—rather, they’re formulated in advance of the data, which is then only used to evaluate the tenability of the theoretical hypothesis. This, in his view, is how we should think about what psychologists are doing—and he credits this supposedly deductivist view to philosophers of science like Popper and Lakatos:

Yarkoni deviates from what is arguably the common approach in psychological science, and suggests induction might actually work: “Eventually, if the effect is shown to hold when systematically varying a large number of other experimental factors, one may even earn the right to summarize the results of a few hundred studies by stating that “cleanliness reduces the severity of moral judgments””. This approach to science flies right in the face of Popper (1959/2002, p. 10), who says: “I never assume that we can argue from the truth of singular statements to the truth of theories. I never assume that by force of ‘verified’ conclusions, theories can be established as ‘true’, or even as merely ‘probable’.”

Similarly, Lakatos (1978, p. 2) writes: “One can today easily demonstrate that there can be no valid derivation of a law of nature from any finite number of facts; but we still keep reading about scientific theories being proved from facts. Why this stubborn resistance to elementary logic?” I am personally on the side of Popper and Lakatos, but regardless of my preferences, Yarkoni needs to provide some argument his inductive approach to science has any possibility of being a success, preferably by embedding his views in some philosophy of science. I would also greatly welcome learning why Popper and Lakatos are wrong. Such an argument, which would overthrow the dominant model of knowledge generation in psychology, could be impactful, although a-priori I doubt it will be very successful.

For reasons that will become clear shortly, I think Lakens’s appeal to Popper and Lakatos here is misguided—those philosophers’ views actually have very little resemblance to the position Lakens stakes out for himself. But let’s start with the distinction Lakens draws between induction and deduction, and the claim that the latter provides an alternative to the former—i.e., that psychologists can avoid making inductive claims if they simply construe what they’re doing as a form of deduction. While this may seem like an intuitive claim at first blush, closer inspection quickly reveals that, far from psychologists having a choice between construing the world in deductive versus inductive terms, they’re actually forced to embrace both forms of reasoning, working in tandem.

There are several ways to demonstrate this, but since Lakens holds deductivism in high esteem, we’ll start out from a strictly deductive position, and then show why our putatively deductive argument eventually requires us to introduce a critical inductive step in order to make any sense out of how contemporary psychology operates.

Let’s start with the following premise:

P1: If theory T is true, we should confirm prediction P

Suppose we want to build a deductively valid argument that starts from the above premise, which seems pretty foundational to hypothesis-testing in psychology. How can we embed P1 into a valid syllogism, so that we can make empirical observations (by testing P) and then updating our belief in theory T? Here’s the most obvious deductively valid way to complete the syllogism:

P1: If theory T is true, we should confirm prediction P
P2: We fail to confirm prediction P
C: Theory T is false

So stated, this modus tollens captures the essence of “naive“ Popperian falsficationism: what scientists do (or ought to do) is attempt to disprove their hypotheses. On this view, if a theory T legitimately entails P, then disconfirming P is sufficient to falsify T. Once that’s done, a scientist can just pack it up and happily move onto the next theory.

Unfortunately, this account, while intuitive and elegant, fails miserably on the reality front. It simply isn’t how scientists actually operate. The problem, as Lakatos famously pointed out, is that the “core“ of a theory T never strictly entails a prediction P by itself. There are invariably other auxiliary assumptions and theories that need to hold true in order for the T → P conditional to apply. For example, observing that people walk more slowly out of a testing room after being primed with old age-related words than with youth-related words doesn’t provide any meaningful support for a theory of social priming unless one is willing to make a large number of auxiliary assumptions—for example, that experimenter knowledge doesn’t inadvertently bias participants; that researcher degrees of freedom have been fully controlled in the analysis; that the stimuli used in the two conditions don’t differ in some irrelevant dimension that can explain the subsequent behavioral change; and so on.

This “sophisticated falsificationism“, as Lakatos dubbed it, is the viewpoint that I gather Lakens thinks most psychologists implicitly subscribe to. And Lakens believes that the deductive nature of the reasoning articulated above is what saves psychologists from having to worry about statistical notions of generalizability.

Unfortunately, this is wrong. To see why, we need only observe that the Popperian and Lakatosian views frame their central deductive argument in terms of falsificationism: researchers can disprove scientific theories by failing to confirm predictions, but—as the Popper statement Lakens approvingly quotes suggests—they can’t affirmatively prove them. This constraint isn’t terribly problematic in heavily quantitative scientific disciplines where theories often generate extremely specific quantitative predictions whose failure would be difficult to reconcile with those theories’ core postulates. For example, Einstein predicted the gravitational redshift of light in 1907 on the basis of his equivalence principle, yet it took nearly 50 years to definitively confirm that prediction via experiment. At the time it was formulated, Einstein’s prediction would have made no sense except in light of the equivalence principle—so the later confirmation of the prediction provided very strong corroboration of the theory (and, by the same token, a failure to experimentally confirm the existence of redshift would have dealt general relativity a very serious blow). Thus, at least in those areas of science where it’s possible to extract extremely “risky“ predictions from one’s theories (more on that later), it seems perfectly reasonable to proceed as if critical experiments can indeed affirmatively corroborate theories—even if such a conclusion isn’t strictly deductively valid.

This, however, is not how almost any psychologists actually operate. As Paul Meehl pointed out in his seminal contrast of standard operating procedures in physics and psychology (Meehl, 1967), psychologists almost never make predictions whose disconfirmation would plausibly invalidate theories. Rather, they typically behave like confirmationists, concluding, on the basis of empirical confirmation of predictions, that their theories are supported (or corroborated). But this latter approach has a logic quite different from the (valid) falsificationist syllogism we saw above. The confirmationist logic that pervades psychology is better represented as follows:

P1: If theory T is true, we should confirm prediction P
P2: We confirm prediction P
C: Theory T is true

C would be a really nice conclusion to draw, if we were entitled to it, because, just as Lakens suggests, we would then have arrived at a way to deduce general theoretical statements from finite observations. Quite a trick indeed. But it doesn’t work; the argument is deductively invalid. If it’s not immediately clear to you why, consider the following argument, which has exactly the same logical structure:

Argument 1
P1: If God loves us all, the sky should be blue
P2: The sky is blue
C: God loves us all

We are not concerned here with the truth of the two premises, but only with the validity of the argument as a whole. And the argument is clearly invalid. Even if we were to assume P1 and P2, C still wouldn’t follow. Observing that the sky is blue (clearly true) doesn’t entail that God loves us all, even if P1 happens to be true, because there could be many other reasons the sky is blue that don’t involve God in any capacity (including, say, differential atmospheric scattering of different wavelengths of light), none of which are precluded by the stated premises.

Now you might want to say, well, sure, but Argument 1 is patently absurd, whereas the arguments Lakens attributes to psychologists are not nearly so silly. But from a strictly deductive standpoint, the typical logic of hypothesis testing in psychology is exactly as silly. Compare the above argument with a running example Lakens (following my paper) uses in his review:

Argument 2
P1: If the theory that cleanliness reduces the severity of moral judgments is true, we should observe condition A > condition B, p < .05
P2: We observe condition A > condition B, p < .05
C: Cleanliness reduces the severity of moral judgments

Subjectively, you probably find this argument much more compelling than the God-makes-the-sky-blue version in Argument 1. But that’s because you’re thinking about the relative plausibility of P1 in the two cases, rather than about the logical structure of the argument. As a purportedly deductive argument, Argument 2 is exactly as bad as Argument 1, and for exactly the same reason: it affirms the consequent. C doesn’t logically follow from P1 and P2, because there could be any number of other potential premises (P3…Pk) that reflect completely different theories yet allow us to derive exactly the same prediction P.

This propensity to pass off deductively nonsensical reasoning as good science is endemic to psychology (and, to be fair, many other sciences). The fact that the confirmation of most empirical predictions in psychology typically provides almost no support for the theories those predictions are meant to test does not seem to deter researchers from behaving as if affirmation of the consequent is a deductively sound move. As Meehl rather colorfully wrote all the way back in 1967:

In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network.

Meehl was hardly alone in taking a dim view of the kind of argument we find in Argument 2, and which Lakens defends as a perfectly respectable “deductive“ way to do psychology. Lakatos—the very same Lakatos that Lakens claims he “is on the side of“—was no fan of it either. Lakatos generally had very little to say about psychology, and it seems pretty clear (at least to me) that his views about how science works were rooted primarily in consideration of natural sciences like physics. But on the few occasions that he did venture an opinion about the “soft“ sciences, he made it abundantly clear that he was not a fan. From Lakatos (1970) :

This requirement of continuous growth … hits patched-up, unimaginative series of pedestrian “˜empirical’ adjustments which are so frequent, for instance, in modern social psychology. Such adjustments may, with the help of so-called “˜statistical techniques’, make some “˜novel’ predictions and may even conjure up some irrelevant grains of truth in them. But this theorizing has no unifying idea, no heuristic power, no continuity. They do not add up to a genuine research programme and are, on the whole, worthless1.

If we follow that footnote 1 after “worthless“, we find this:

After reading Meehl (1967) and Lykken (1968) one wonders whether the function of statistical techniques in the social sciences is not primarily to provide a machinery for producing phoney corroborations and thereby a semblance of “scientific progress” where, in fact, there is nothing but an increase in pseudo-intellectual garbage. “¦ It seems to me that most theorizing condemned by Meehl and Lykken may be ad hoc3. Thus the methodology of research programmes might help us in devising laws for stemming this intellectual pollution …

By ad hoc3, Lakatos means that social scientists regularly explain anomalous findings by concocting new post-hoc explanations that may generate novel empirical predictions, but don’t follow in any sensible way from the “positive heuristic“ of a theory (i.e., the set of rules and practices that describe in advance how a researcher ought to interpret and respond to discrepancies). Again, here’s Lakatos:

In fact, I define a research programme as degenerating even if it anticipates novel facts but does so in a patched-up development rather than by a coherent, pre-planned positive heuristic. I distinguish three types of ad hoc auxiliary hypotheses: those which have no excess empirical content over their predecessor (‘ad hoc1’), those which do have such excess content but none of it is corroborated (‘ad hoc2’) and finally those which are not ad hoc in these two senses but do not form an integral part of the positive heuristic (‘ad hoc3’). “¦ Some of the cancerous growth in contemporary social ‘sciences’ consists of a cobweb of such ad hoc3 hypotheses, as shown by Meehl and Lykken.

The above quotes are more or less the extent of what Lakatos had to say about psychology and the social sciences in his published work.

Now, I don’t claim to be able to read the minds of deceased philosophers, but in view of the above, I think it’s safe to say that Lakatos probably wouldn’t have appreciated Lakens claiming to be “on his side“. If Lakens wants to call the kind of view that considers Argument 2 a good way to do empirical science, fine; but I’m going to refer to it as Lakensian deductivism from here on out, because it’s not deductivism in any sense that approximates the normal meaning of the word “deductive“ (I mean, it’s actually deductively invalid!), and I suspect Popper, Lakatos, and Meehl­ might have politely (or maybe not so politely) asked Lakens to cease and desist from implying that they approve of, or share, his views.

Induction to the rescue

So far, things are not looking so good for a strictly deductive approach to psychology. If we follow Lakens in construing deduction and induction as competing philosophical worldviews, and insist on banishing any kind of inductive reasoning from our inferential procedures, then we’re stuck facing up to the fact that virtually all hypothesis testing done by psychologists is actually deductively invalid, because it almost invariably has the logical form captured in Argument 2. I think this is a rather unfortunate outcome, if you happen to be a proponent of a view that you’re trying to convince people merits the label “deduction“.

Fortunately, all is not lost. It turns out that there is a way to turn Argument 2 into a perfectly reasonable basis for doing empirical science of the psychological variety. Unfortunately for Lakens, it runs directly through the kinds of arguments laid out in my paper. To see that, let’s first observe that we can turn the logically invalid Argument 2 into a valid syllogism by slightly changing the wording of P1:

Argument 3
P1: If, and only if, cleanliness reduces the severity of moral judgments, we should find that condition A > condition B, p < .05
P2: We find that condition A > condition B, p < .05
C: Cleanliness reduces the severity of moral judgments

Notice the newly-added words and only if in P1. They makes all the difference! If we know that the prediction P can only be true if theory T is correct, then observing P does in fact allow us to deductively conclude that T is correct. Hooray!

Well, except that this little modification, which looks so lovely on paper, doesn’t survive contact with reality, because in psychology, it’s almost never the case that a given prediction could only have plausibly resulted from one’s favorite theory. Even if you think P1 is true in Argument 2 (i.e., the theory really does make that prediction), it’s clearly false in our updated Argument 3. There are lots of other reasons why we might observe the predicted result, p < .05, even if the theoretical hypothesis is false (i.e., if cleanliness doesn’t reduce the severity of moral judgment). For example, maybe the stimuli in condition A differ on some important but theoretically irrelevant dimension from those in B. Or maybe there are demand characteristics that seep through to the participants despite the investigators’ best efforts. Or maybe the participants interpret the instructions in some unexpected way, leading to strange results. And so on.

Still, we’re on the right track. And we can tighten things up even further by making one last modification: we replace our biconditional P1 above with the following probabilistic version:

Argument 4
P1: It’s unlikely that we would observe A > B, p < .05, unless cleanliness reduces the severity of moral judgments
P2: We observe A > B, p < .05
C1: It’s probably true that cleanliness reduces the severity of moral judgments

Some logicians might quibble with Argument 4, because replacing words like “all“ and “only“ with words like “probably“ and “unlikely“ requires some careful thinking about the relationship between logical and probabilistic inference. But we’ll ignore that here. Whatever modifications you need to make to enable your logic to handle probabilistic statements, I think the above is at least a sensible way for psychologists to proceed when testing hypotheses. If it’s true that the predicted result is unlikely unless the theory is true, and we confirm the prediction, then it seems reasonable to assert (with full recognition that one might be wrong) that the theory is probably true.

But now the other shoe drops. Because even if we accept that Argument 4 is (for at least some logical frameworks) valid, we still need to show that it’s sound. And soundness requires the updated P1 to be true. If P1 isn’t true, then the whole enterprise falls apart again; nobody is terribly interested in scientific arguments that are logically valid but empirically false. We saw that P1 in Argument 2 was uncontroversial, but was embedded in a logically invalid argument. And conversely, P1 in Argument 3 was embedded in a logically valid argument, but was clearly indefensible. Now we’re suggesting that P1 in Argument 4, which sits somewhere in between Argument 2 and Argument 3, manages to capture the strengths of both of the previous arguments, while avoiding their weaknesses. But we can’t just assert this by fiat; it needs to be demonstrated somehow. So how do we do that?

The banal answer is that, at this point, we have to start thinking about the meanings of the words contained in P1, and not just about the logical form of the entire argument. Basically, we need to ask ourselves: is it really true that all other explanations for the predicted statistical result, are, in the aggregate, unlikely?

Notice that, whether we like it or not, we are now compelled to think about the meaning of the statistical prediction itself. To evaluate the claim that the result A > B (p < .05) would be unlikely unless the theoretical hypothesis is true, we need to understand the statistical model that generated the p-values in question. And that, in turn, forces us to reason inductively, because inferential statistics is, by definition, about induction. The point of deploying inferential statistics, rather than constraining one’s self to only describing the sampled measurements, is to generalize beyond the observed sample to a broader population. If you want to know whether the predicted p-value follows from your theory, you need to know whether the population your verbal hypothesis applies to is well approximated by the population your statistical model affords generalization to. If it isn’t, then there’s no basis for positing a premise like P1.

Once we’ve accepted this much—and to be perfectly blunt about it, if you don’t accept this much, you probably shouldn’t be using inferential statistics in the first place—then we have no choice but to think carefully about the alignment between our verbal and statistical hypotheses. Is P1 in Argument 4 true? Is it really the case that observing A > B, p < .05, would be unlikely unless cleanliness reduces the severity of moral judgments? Well that depends. What population of hypothetical observations does the model that generates the p-value refer to? Does it align with the population implied by the verbal hypothesis?

This is the critical question one must answer, and there’s no way around it. One cannot claim, as Lakens tries to, that psychologists don’t need to worry about inductive inference, because they’re actually doing deduction. Induction and deduction are not in opposition here; they’re actually working in tandem! Even if you agree with Lakens and think that the overarching logic guiding psychological hypothesis testing is of the deductive form expressed in Argument 4 (as opposed to the logically invalid form in Argument 2, as Meehl suggested), you still can’t avoid the embedded inductive step captured by P1, unless you want to give up the use of inferential statistics entirely.

The bottom line is that Lakens—and anyone else who finds the flavor of so-called deductivism he advocates appealing—faces a dilemma on two horns. One way to deal with the fact that Lakensian deductivism is in fact deductively invalid is to lean into it and assert that, logic notwithstanding, this is just how psychologists operate, and the important thing is not whether or not the logic makes deductive sense if you scrutinize it closely, but whether it allows people to get on with their research in a way they’re satisfied with.

The upside of such a position is that it allows you to forever deflect just about any criticism of what you’re doing simply by saying “well, the theory seems to me to follow from the prediction I made“. The downside—and it’s a big one, in my opinion—is that science becomes a kind of rhetorical game, because at that point there’s pretty much nothing anybody else can say to disabuse you of the belief that you’ve confirmed your theory. The only thing that’s required is that the prediction make sense to you (or, if you prefer, to you plus two or three reviewers). A secondary consequence is that it also becomes impossible to distinguish the kind of allegedly scientific activity psychologists engage in from, say, postmodern scholarship, so a rather unwelcome conclusion of taking Lakens’s view seriously is that we may as well extend the label science to the kind of thing that goes on in journals like Social Text. Maybe Lakens is okay with this, but I very much doubt that this is the kind of worldview most psychologists want to commit themselves to.

The more sensible alternative is to accept that the words and statistics we use do actually need to make contact with a common understanding of reality if we’re to be able to make progress. This means that when we say things like “it’s unlikely that we would observe a statistically significant effect here unless our theory is true“, evaluation of such a statement requires that one be able to explain, and defend, the relationship between the verbal claims and the statistical quantities on which the empirical support is allegedly founded.

The latter, rather weak, assumption—essentially, that scientists should be able to justify the premises that underlie their conclusions—is all my paper depends on. Once you make that assumption, nothing more depends on your philosophy of science. You could be a Popperian, a Lakatosian, an inductivist, a Lakensian, or an anarchist… It really doesn’t matter, because, unless you want to embrace the collapse of science into postmodernism, there’s no viable philosophy of science under which scientists get to use words and statistics in whatever way they like, without having to worry about the connection between them. If you expect to be taken seriously as a scientist who uses inferential statistics to draw conclusions from empirical data, you’re committed to caring about the relationship between the statistical models that generate your p-values and the verbal hypotheses you claim to be testing. If you find that too difficult or unpleasant, that’s fine (I often do too!); you can just drop the statistics from your arguments, and then it’s at least clear to people that your argument is purely qualitative, and shouldn’t be accorded the kind of reception we normally reserve (fairly or unfairly) for quantitative science. But you don’t get to claim the prestige and precision that quantitation seems to confer on researchers while doing none of the associated work. And you certainly can’t avoid doing that work simply by insisting that you’re doing a weird, logically fallacious, kind of “deduction“.

Unfair to severity

Lakens’s second major criticism is that I’m too hard on the notion of severity. He argues that I don’t give the Popper/Meehl/Mayo risky prediction/severe testing school of thought sufficient credit, and that it provides a viable alternative to the kind of position he takes me to be arguing for. Lakens makes two main points, which I’ll dub Severity I and Severity II.

Severity I

First, Lakens argues that my dismissal of risky or severe tests as a viable approach in most of psychology is unwarranted. I’ll quote him at length here, because the core of his argument is embedded in some other stuff, and I don’t want to be accused of quoting out of context (note that I did excise one part of the quote, because I deal with it separately below):

Yarkoni’s criticism on the possibility of severe tests is regrettably weak. Yarkoni says that “Unfortunately, in most domains of psychology, there are pervasive and typically very plausible competing explanations for almost every finding.” From his references (Cohen, Lykken, Meehl) we can see he refers to the crud factor, or the idea that the null hypothesis is always false. As we recently pointed out in a review paper on crud (Orben & Lakens, 2019), Meehl and Lykken disagreed about the definition of the crud factor, the evidence of crud in some datasets can not be generalized to all studies in pychology, and “The lack of conceptual debate and empirical research about the crud factor has been noted by critics who disagree with how some scientists treat the crud factor as an “axiom that needs no testing” (Mulaik, Raju, & Harshman, 1997).”. Altogether, I am very unconvinced by this cursory reference to crud makes a convincing point that “there are pervasive and typically very plausible competing explanations for almost every finding”. Risky predictions seem possible, to me, and demonstrating the generalizability of findings is actually one way to perform a severe test.

When Yarkoni discusses risky predictions, he sticks to risky quantitative predictions. As explained in Lakens (2020), “Making very narrow range predictions is a way to make it statistically likely to falsify your prediction if it is wrong. But the severity of a test is determined by all characteristics of a study that increases the capability of a prediction to be wrong, if it is wrong. For example, by predicting you will only observe a statistically significant difference from zero in a hypothesis test if a very specific set of experimental conditions is met that all follow from a single theory, it is possible to make theoretically risky predictions.” “¦ It is unclear to me why Yarkoni does not think that approaches such as triangulation (Munafò & Smith, 2018) are severe tests. I think these approaches are the driving force between many of the more successful theories in social psychology (e.g., social identity theory), and it works fine.

There are several relatively superficial claims Lakens makes in these paragraphs that are either wrong or irrelevant. I’ll take them up below, but let me first address the central claim, which is that, contrary to the argument I make in my paper, risky prediction in the Popper/Meehl/Mayo sense is actually a viable strategy in psychology.

It’s instructive to note that Lakens doesn’t actually provide any support for this assertion; his argument is entirely negative. That is, he argues that I haven’t shown severity to be impossible. This is a puzzling way to proceed, because the most obvious way to refute an argument of the form “it’s almost impossible to do X“ is to just point to a few garden variety examples where people have, in fact, successfully done X. Yet at no point in Lakens’s lengthy review does he provide any actual examples of severe tests in psychology—i.e., of cases where the observed result would be extremely implausible if the favored theory were false. This omission is hard to square with his insistence that severe testing is a perfectly sensible approach that many psychologists already use successfully. Hundreds of thousands of papers have been published in psychology over the past century; if an advocate of a particular methodological approach can’t identify even a tiny fraction of the literature that has successfully applied that approach, how seriously should that view be taken by other people?

As background, I should note that Lakens’s inability to give concrete examples of severe testing isn’t peculiar to his review of my paper; in various interactions we’ve had over the last few years, I’ve repeatedly asked him to provide such examples. He’s obliged exactly once, suggesting this paper, titled Ego Depletion Is Not Just Fatigue: Evidence From a Total Sleep Deprivation Experiment by Vohs and colleagues.

In the sole experiment Vohs et al. report, they purport to test the hypothesis that ego depletion is not just fatigue (one might reasonably question whether there’s any non-vacuous content to this hypothesis to begin with, but that’s a separate issue). They proceed by directing participants who either have or have not been deprived of sleep to suppress their emotions while viewing disgusting video clips. In a subsequent game, they then ask the same participants to decide (seemingly incidentally) how loud a noise to blast an opponent with—a putative measure of aggression. The results show that participants who suppressed emotion selected louder volumes than those who did not, whereas the sleep deprivation manipulation had no effect.

I leave it as an exercise to the reader to decide for themselves whether the above example is a severe test of the theoretical hypothesis. To my mind, at least, it clearly isn’t; it fits very comfortably into the category of things that Meehl and Lakatos had in mind when discussing the near-total disconnect between verbal theories and purported statistical evidence. There are dozens, if not hundreds, of ways one might obtain the predicted result even if the theoretical hypothesis Vos et al. articulate were utterly false (starting from the trivial observation that one could obtain the pattern the authors reported even if the two manipulations tapped exactly the same construct but were measured with different amounts of error). There is nothing severe about the test, and to treat it as such is to realize Meehl and Lakatos’s worst fears about the quality of hypothesis-testing in much of psychology.

To be clear, I did not suggest in my paper (nor am I here) that severe tests are impossible to construct in psychology. I simply observed that they’re not a realistic goal in most domains, particularly in “soft“ areas (e.g., social psychology). I think I make it abundantly clear in the paper that I don’t see this as a failing of psychologists, or of their favored philosophy of science; rather, it’s intrinsic to the domain itself. If you choose to study extremely complex phenomena, where any given behavior is liable to be a product of an enormous variety of causal factors interacting in complicated ways, you probably shouldn’t expect to be able to formulate clear law-like predictions capable of unambiguously elevating one explanation above others. Social psychology is not physics, and there’s no reason to think that methodological approaches that work well when one is studying electrons and quarks should also work well when one is studying ego depletion and cognitive dissonance.

As for the problematic minor claims in the paragraphs I quoted above (you can skip down to the “Severity II“ section you’re bored or short on time)… First, the citations to Cohen, Lykken, and Meehl contain well-developed arguments to the same effect as my claim that “there are pervasive and typically very plausible competing explanations for almost every finding“. These arguments do not depend on what one means by “crud“, which is the subject of Orben & Lakens (2019). The only point relevant to my argument is that outcomes in psychology are overwhelmingly determined by many factors, so that it’s rare for a hypothesized effect in psychology to have no plausible explanation other than the authors’ preferred theoretical hypothesis. I think this is self-evidently true, and needs no further justification. But if you think it does require justification, I invite you to convince yourself of it in the following easy steps: (1) Write down 10 or 20 random effects that you feel are a reasonably representative sample of your field. (2) For each one, spend 5 minutes trying to identify alternative explanations for the predicted result that would be plausible even if the researcher’s theoretical hypothesis were false. (3) Observe that you were able to identify plausible confounds for all of the effects you wrote down. There, that was easy, right?

Second, it isn’t true that I stick to risky quantitative predictions. I explicitly note that risky predictions can be non-quantitative:

The canonical way to accomplish this is to derive from one’s theory some series of predictions—typically, but not necessarily, quantitative in nature—sufficiently specific to that theory that they are inconsistent with, or at least extremely implausible under, other accounts.

I go on to describe several potential non-quantitative approaches (I even cite Lakens!):

This does not mean, however, that vague directional predictions are the best we can expect from psychologists. There are a number of strategies that researchers in such fields could adopt that would still represent at least a modest improvement over the status quo (for discussion, see Meehl, 1990). For example, researchers could use equivalence tests (Lakens, 2017); predict specific orderings of discrete observations; test against compound nulls that require the conjunctive rejection of many independent directional predictions; and develop formal mathematical models that posit non-trivial functional forms between the input and ouput (Marewski & Olsson, 2009; Smaldino, 2017).

Third, what Lakens refers to as “triangulation“ is, as far as I can tell, conceptually akin to a logical conjunction of effects suggested above, so again, it’s unfair to say that I oppose this idea. I support it—in principle. However, two points are worth noting. First, the practical barrier to treating conjunctive rejections as severe tests is that it requires researchers to actually hold their own feet to the fire by committing ahead of time to the specific conjunction that they deem a severe test. It’s not good enough to state ahead of time that the theory makes 6 predictions, and then, when results reveal that the theory only confirms 4 of the predictions, to generate some post-hoc explanation for the 2 failed predictions while still claiming that the theory managed to survive a critical test.

Second, as we’ve already seen, the mere fact that a researcher believes a test is severe does not actually make it so, and there are good reasons to worry that many researchers grossly underestimate the degree of actual support a particular statistical procedure (or conjunction of procedures) actually confers on a theory. For example, you might naively suppose that if your theory makes 6 independent directional predictions—implying a probability of 2^6, or 1.5%, of getting all 6 right purely by chance—then joint corroboration of all your predictions provides strong support for your theory. But this isn’t generally the case, because many plausible competing accounts in psychology will tend to generate similarly-signed predictions. As a trivial example, when demand characteristics are present, they will typically tend to push in the direction of the researcher’s favored hypotheses.

The bottom line is that, while triangulation is a perfectly sensible strategy in principle, deploying it in a way that legitimately produces severe tests of psychological theories does not seem any easier than the other approaches I mention—nor, again, does Lakens seem able to provide any concrete examples.

Severity II

Lakens’s second argument regarding severity (or my alleged lack of respect for it) is that I put the cart before the horse: whereas I focus largely on the generalizability of claims made on the basis of statistical evidence, Lakens argues that generalizability is purely an instrumental goal, and that the overarching objective is severity. He writes:

I think the reason most psychologists perform studies that demonstrate the generalizability of their findings has nothing to do with their desire to inductively build a theory from all these single observations. They show the findings generalize, because it increases the severity of their tests. In other words, according to this deductive approach, generalizability is not a goal in itself, but a it follows from the goal to perform severe tests.

And:

Generalization as a means to severely test a prediction is common, and one of the goals of direct replications (generalizing to new samples) and conceptual replications (generalizing to different procedures). Yarkoni might disagree with me that generalization serves severity, not vice versa. But then what is missing from the paper is a solid argument why people would want to generalize to begin with, assuming at least a decent number of them do not believe in induction. The inherent conflict between the deductive approaches and induction is also not explained in a satisfactory manner.

As a purported criticism of my paper, I find this an unusual line of argument, because not only does it not contradict anything I say in my paper, it actually directly affirms it. In effect, Lakens is saying yes, of course it matters whether the statistical model you use maps onto your verbal hypothesis; how else would you be able to formulate a severe test of the hypothesis using inferential statistics? Well, I agree with him! My only objection is that he doesn’t follows his own argument far enough. He writes that “generalization as a means to severely test a prediction is common“, but he’s being too modest. It isn’t just common; for studies that use inferential statistics, it’s universal. If you claim to be using statistical results to test your theoretical hypotheses, you’re obligated to care about the alignment between the universes of observations respectively defined by your verbal and statistical hypotheses. As I’ve pointed out at length above, this isn’t a matter of philosophical disagreement (i.e., of some imaginary “inherent conflict between the deductive approaches and induction“); it’s definitional. Inferential statistics is about generalizing from samples to populations. How could you possibly assert that a statistical test of a hypothesis is severe if you have no idea whether the population defined by your statistical model aligns with the one defined by your verbal hypothesis? Can Lakens provide an example of a severe statistical test that doesn’t require one to think about what population of observations a model applies to? I very much doubt it.

For what it’s worth, I don’t think the severity of hypothesis testing is the only reason to worry about the generalizability of one’s statistical results. We can see this trivially, inasmuch as severity only makes sense in a hypothesis testing context, whereas generalizability matters any time inferential statistics (which make reference to some idealized population) are invoked. If you report a p-value from a linear regression model, I don’t need to know what hypothesis motivated the analysis in order to interpret the results, but I do need to understand what universe of hypothetical observations the statistical model you specified refers to. If Lakens wants to argue that statistical results are uninterpretable unless they’re presented as confirmatory tests of an a priori hypothesis, that’s his prerogative (though I doubt he’ll find many takers for that view). At the very least, though, it should be clear that his own reasoning gives one more, and not less, reason to take the arguments in my paper seriously.

Hopelessly impractical

[Attention conservation notice: the above two criticisms are the big ones; you can safely stop reading here without missing much. The stuff below is frankly more a reflection of my irritation at some of Lakens’s rhetorical flourishes than about core conceptual issues.]

A third theme that shows up repeatedly in Lakens’s review is the idea that the arguments I make, while perhaps reasonable from a technical standpoint, are far too onerous to expect real researchers to implement. There are two main strands of argument here. Both of them, in my view, are quite wrong. But one of them is wrong and benign, whereas the other is wrong and possibly malignant.

Impractical I

The first (benign) strand is summarized by Lakens’s Point 3, which he titles theories and tests are not perfectly aligned in deductive approaches. As we’ll see momentarily, “perfectly“ is a bit of a weasel word that’s doing a lot of work for Lakens here. But his general argument is that you only need to care about the alignment between statistical and verbal specifications of a hypothesis if you’re an inductivist:

To generalize from a single observation to a general theory through induction, the sample and the test should represent the general theory. This is why Yarkoni is arguing that there has to be a direct correspondence between the theoretical model, and the statistical test. This is true in induction.

I’ve already spent several thousand words above explaining why this is simply false. To recap (I know I keep repeating myself, but this really is the crux of the whole issue): if you’re going to report inferential statistics and claim that they provide support for your verbal hypotheses, then you’re obligated to care about the correspondence between the test and the theory. This doesn’t require some overarching inductivist philosophy of science (which is fortunate, because I don’t hold one myself); it only requires you to believe that when you make statements of the form “statistic X provides evidence for verbal claim Y“, you should be able to explain why that’s true. If you can’t explain why the p-value (or Bayes Factor, etc.) from that particular statistical specification supports your verbal hypothesis, but a different specification that produces a radically different p-value wouldn’t, it’s not clear why anybody else should take your claims seriously. After all, inferential statistics aren’t (or at least, shouldn’t be) just a kind of arbitrary numerical magic we sprinkle on top of our words to get people to respect us. They mean things. So the alternative to caring about the relationship between inferential statistics and verbal claims is not, as Lakens seems to think, deductivism—it’s ritualism.

The tacit recognition of this point is presumably why Lakens is careful to write that “theories and tests are not perfectly aligned in deductive approaches“ (my emphasis). If he hadn’t included the word “perfectly“, the claim would seem patently silly, since theories and tests obviously need to be aligned to some degree no matter what philosophical view one adopts (save perhaps for outright postmodernism). Lakens’s argument here only makes any sense if the reader can be persuaded that my view, unlike Lakens’, demands perfection. But it doesn’t (more on that below).

Lakens then goes on to address one of the central planks of my argument, namely, the distinction between fixed and random factors (which typically has massive implications for the p-values one observes). He suggests that while the distinction is real, it’s wildly unrealistic to expect anybody to actually be able to respect it:

If I want to generalize beyond my direct observations, which are rarely sampled randomly from all possible factors that might impact my estimate, I need to account for uncertainty in the things I have not observed. As Yarkoni clearly explains, one does this by adding random factors to a model. He writes (p. 7) “Each additional random factor one adds to a model licenses generalization over a corresponding population of potential measurements, expanding the scope of inference beyond only those measurements that were actually obtained. However, adding random factors to one’s model also typically increases the uncertainty with which the fixed effects of interest are estimated”. You don’t need to read Popper to see the problem here – if you want to generalize to all possible random factors, there are so many of them, you will never be able to overcome the uncertainty and learn anything. This is why inductive approaches to science have largely been abandoned.

You don’t need to read Paul Meehl’s Big Book of Logical Fallacies to see that Lakens is equivocating. He equates wanting to generalize beyond one’s sample with wanting to generalize “to all possible random factors“—as if the only two possible interpretations of an effect are that it either generalizes to all conceivable scenarios, or that it can’t be generalized beyond the sample at all. But this just isn’t true; saying that researchers should build statistical models that reflect their generalization intentions is not the same as saying that every mixed-effects model needs to include all variance components that could conceivably have any influence, however tiny, on the measured outcomes. Lakens presents my argument as a statistically pedantic, technically-correct-but-hopelessly-ineffectual kind of view—at which point it’s supposed to become clear to the reader that it’s just crazy to expect psychologists to proceed in the way I recommend. And I agree that it would be crazy—if that was actually what I was arguing. But it isn’t. I make it abundantly clear in my paper that aligning verbal and statistical hypotheses needn’t entail massive expansion of the latter; it can also (and indeed, much more feasibly) entail contraction of the former. There’s an entire section in the paper titled Draw more conservative inferences that begins with this:

Perhaps the most obvious solution to the generalizability problem is for authors to draw much more conservative inferences in their manuscripts—and in particular, to replace the hasty generalizations pervasive in contemporary psychology with slower, more cautious conclusions that hew much more closely to the available data. Concretely, researchers should avoid extrapolating beyond the universe of observations implied by their experimental designs and statistical models. Potentially relevant design factors that are impractical to measure or manipulate, but that conceptual considerations suggest are likely to have non-trivial effects (e.g., effects of stimuli, experimenter, research site, culture, etc.), should be identified and disclosed to the best of authors’ ability.

Contra Lakens, this is hardly an impractical suggestion; if anything, it offers to reduce many authors’ workload, because Introduction and Discussion sections are typically full of theoretical speculations that go well beyond the actual support of the statistical results. My prescription, if taken seriously, would probably shorten the lengths of a good many psychology papers. That seems pretty practical to me.

Moreover—and again contrary to Lakens’s claim—following my prescription would also dramatically reduce uncertainty rather than increasing it. Uncertainty arises when one lacks data to inform one’s claims or beliefs. If maximal certainty is what researchers want, there are few better ways to achieve that than to make sure their verbal claims cleave as closely as possible to the boundaries implicitly defined by their experimental procedures and statistical models, and hence depend on fewer unmodeled (and possibly unknown) variables.

Impractical II

The other half of Lakens’s objection from impracticality is to suggest that, even if the arguments I lay out have some merit from a principled standpoint, they’re of little practical use to most researchers, because I don’t do enough work to show readers how they can actually use those principles in their own research. Lakens writes:

The issues about including random factors is discussed in a more complete, and importantly, applicable, manner in Barr et al (2013). Yarkoni remains vague on which random factors should be included and which not, and just recommends ‘more expansive’ models. I have no idea when this is done satisfactory. This is a problem with extreme arguments like the one Yarkoni puts forward. It is fine in theory to argue your test should align with whatever you want to generalize to, but in practice, it is impossible. And in the end, statistics is just a reasonably limited toolset that tries to steer people somewhat in the right direction. The discussion in Barr et al (2013), which includes trade-offs between converging models (which Yarkoni too easily dismisses as solved by modern computational power – it is not solved) and including all possible factors, and interactions between all possible factors, is a bit more pragmatic.“

And:

As always, it is easy to argue for extremes in theory, but this is generally uninteresting for an applied researcher. It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models” – and provides some indication where to stop, or at least suggestions what an empirical research program would look like that tells us where to stop, and why.

And:

Previous authors have made many of the same points, but in a more pragmatic manner (e.g., Barr et al., 2013m Clark, 1974,). Yarkoni fails to provide any insights into where the balance between generalizing to everything, and generalizing to factors that matter, should lie, nor does he provide an evaluation of how far off this balance research areas are. It is easy to argue any specific approach to science will not work in theory – but it is much more difficult to convincingly argue it does not work in practice.

There are many statements in Lakens’s review that made me shake my head, but the argument advanced in the above quotes is the only one that filled me (briefly) with rage. In part that’s because parts of what Lakens says here blatantly misrepresent my paper. For example, he writes that “Yarkoni just recommends “˜more expansive models’“, which is frankly a bit insulting given that I spend a full third of my paper talking about various ways to address the problem (e.g., by designing studies that manipulate many factors at once; by conducting meta-analyses over variance components; etc.).

Similarly, Lakens implies that Barr et al. (2013) gives better versions of my arguments, when actually the two papers are doing completely different things. Barr et al. (2013) is a fantastic paper, but it focuses almost entirely on the question of how one should specify and estimate mixed-effects models, and says essentially nothing about why researchers should think more carefully about random factors, or which ones researchers ought to include in their model. One way to think about it is that Barr et al. (2013) is the paper you should read after my paper has convinced you that it actually matters a lot how you specify your random-effects structure. Of course, if you’re already convinced of the latter (which many people are, though Lakens himself doesn’t seem to be), then yeah, you should maybe skip my paper““you’re not the intended audience.

In any case, the primary reason I found this part of Lakens’s review upsetting is that the above quotes capture a very damaging, but unfortunately also very common, sentiment in psychology, which is the apparent belief that somebody—and perhaps even nature itself—owes researchers easy solutions to extremely complex problems.

Lakens writes that “Yarkoni remains vague on which random factors should be included and which not“, and that “ It would be great if Yarkoni could provide something a bit more pragmatic about what to do in practice than his current recommendation about fitting “more expansive models”. Well, on a superficial level, I agree with Lakens: I do remain vague on which factors should be included, and it would be lovely if I were able to say something like “here, Daniel, I’ve helpfully identified for you the five variance components that you need to care about in all your studies“. But I can’t say something like that, because it would be a lie. There isn’t any such one-size-fits-all prescription—and trying to pretend there is would, in my view, be deeply counterproductive. Psychology is an enormous field full of people trying to study a very wide range of complex phenomena. There is no good reason to suppose that the same sources of variance will assume even approximately the same degree of importance across broad domains, let alone individual research questions. Should psychophysicists studying low-level visual perception worry about the role of stimulus, experimenter, or site effects? What about developmental psychologists studying language acquisition? Or social psychologists studying cognitive dissonance? I simply don’t know.

One reason I don’t know, as I explain in my paper, is that the answer depends heavily on what conclusions one intends to draw from one’s analyses—i.e., on one’s generalization intentions. I hope Lakens would agree with me that it’s not my place to tell other people what their goal should be in doing their research. Whether or not a researcher needs to model stimuli, sites, tasks, etc. as random factors depends on what claim they intend to make. If a researcher intends to behave as if their results apply to a population of stimuli like the ones one used in their study, and not just to the exact sampled stimuli, then they should use a statistical model that reflects that intention. But if they don’t care to make that generalization, and are comfortable drawing no conclusions beyond the confines of the tested stimuli, then maybe they don’t need to worry about explicitly modeling stimulus effects at all. Either way, what determines whether or not a statistical model is or isn’t appropriate is whether or not that model adequately captures what a researcher claims it’s capturing—not whether Tal Yarkoni has data suggesting that, on average, site effects are large in one area of social psychology but not large in another area of psychophysics.

The other reason I can’t provide concrete guidance about what factors psychologists ought to model as random is that attempting to establish even very rough generalizations of this sort would involve an enormous amount of work—and the utility of that work would be quite unclear, given how contextually specific the answers are likely to be. Lakens himself seems to recognize this; at one point in his review, he suggests that the topic I address “probably needs a book length treatment to do it justice.“ Well, that’s great, but what are working researchers supposed to do in the meantime? Is the implication that psychologists should feel free to include whatever random effects they do or don’t feel like in their models until such time as someone shows up with a compendium of variance component estimates that apply to different areas of psychology? Does Lakens also dismiss papers seeking to convince people that it’s important to consider statistical power when designing studies, unless those papers also happen to provide ready-baked recommendations for what an appropriate sample size is for different research areas within psychology? Would he also conclude that there’s no point in encouraging researchers to define “smallest effect sizes of interest“, as he himself has done in the past, unless one can provide concrete recommendations for what those numbers should be?

I hope not. Such a position would amount to shooting the messenger. The argument in my paper is that model specification matters, and that researchers need to think about that carefully. I think I make that argument reasonably clearly and carefully. Beyond that, I don’t think it’s my responsibility to spend the next N years of my own life trying to determine what factors matter most in social, developmental, or cognitive psychology, just so that researchers in those fields can say, “thanks, your crummy domain-general estimates are going to save me from having to think deeply about what influences matter in my own particular research domain“. I think it’s every individual researcher’s job to think that through for themselves, if they expect to be taken seriously.

Lastly, and at the risk of being a bit petty (sorry), I can’t resist pointing out what strikes me as a rather serious internal contradiction between Lakens’s claim that my arguments are unhelpful unless they come with pre-baked variance estimates, and his own stated views about severity. On the one hand, Lakens claims that psychologists ought to proceed by designing studies that subject their theoretical hypotheses to severe tests. On the other hand, he seems to have no problem with researchers mindlessly following field-wide norms when specifying their statistical models (e.g., modeling only subjects as random effects, because those are the current norms). I find these two strands of thought difficult to reconcile. As we’ve already seen, the severity of a statistical procedure as a test of a theoretical hypothesis depends on the relationship between the verbal hypothesis and the corresponding statistical specification. How, then, could a researcher possibly feel confident that their statistical procedure constitutes a severe test of their theoretical hypothesis, if they’re using an off-the-shelf model specification and have no idea whether they would have obtained radically different results if they had randomly sampled a different set of stimuli, participants, experimenters, or task operationalizations?

Obviously, it can’t. Having to think carefully about what the terms in one’s statistical model mean, how they relate to one’s theoretical hypothesis, and whether those assumptions are defensible, isn’t at all “impractical“; it’s necessary. If you can’t explain clearly why a model specification that includes only subjects as random effects constitutes a severe test of your hypothesis, why would you expect other people to take your conclusions at face value?

Trouble with titles

There’s one last criticism Lakens raises in his review of my paper. It concerns claims I make about the titles of psychology papers:

This is a minor point, but I think a good illustration of the weakness of some of the main arguments that are made in the paper. On the second page, Yarkoni argues that “the vast majority of psychological scientists have long operated under a regime of (extremely) fast generalization”. I don’t know about the vast majority of scientists, but Yarkoni himself is definitely using fast generalization. He looked through a single journal, and found 3 titles that made general statements (e.g., “Inspiration Encourages Belief in God”). When I downloaded and read this article, I noticed the discussion contains a ‘constraint on generalizability’ in the discussion, following (Simons et al., 2017). The authors wrote: “We identify two possible constraints on generality. First, we tested our ideas only in American and Korean samples. Second, we found that inspiring events that encourage feelings of personal insignificance may undermine these effects.”. Is Yarkoni not happy with these two sentence clearly limiting the generalizability in the discussion?

I was initially going to respond to this in detail, but ultimately decided against it, because (a) by Lakens’ own admission, it’s a minor concern; (b) this is already very long as-is; and (c) while it’s a minor point in the context of my paper, I think this issue has some interesting and much more general implications for how we think about titles. So I’ve decided I won’t address it here, but will eventually take it up in a separate piece that gives it a more general treatment, and that includes a kind of litmus test one can use to draw reasonable conclusions about whether or not a title is appropriate. But, for what it’s worth, I did do a sweep through the paper in the process of revision, and have moderated some of the language.

Conclusion

Daniel Lakens argues that psychologists don’t need to care much if at all about the relationship between their statistical model specifications and their verbal hypotheses, because hypothesis testing in psychology proceeds deductively: researchers generate predictions from their theories, and then update their confidence in their theories on the basis of whether or not those predictions are confirmed. This all sounds great until you realize that those predictions are almost invariably evaluated using inferential statistical methods that are inductive by definition. So long as psychologists are relying on inferential statistics as decision aids, there can be no escape from induction. Deduction and induction are not competing philosophies or approaches; the standard operating procedure in psychology is essentially a hybrid of the two.

If you don’t like the idea that the ability to appraise a verbal hypothesis using statistics depends critically on the ability to understand and articulate how the statistical terms map onto the verbal ideas, that’s fine; an easy way to solve that problem is to just not use inferential statistics. That’s a perfectly reasonable position, in my view (and one I discuss at length in my paper). But once you commit yourself to relying on things like p-values and Bayes Factors to help you decide what you believe about the world, you’re obligated to think about, justify, and defend your statistical assumptions. They aren’t, or shouldn’t be, just a kind of pedantic technical magic you can push-button sprinkle on top of your favorite verbal hypotheses to make them really stick.

Fly less, give more

Russ Poldrack writes that he will be flying less:

I travel a lot – I have almost 1.3 million lifetime miles on United Airlines, and in the last few years have regularly flown over 100,000 miles per year. This travel has definitely helped advance my scientific career, and has been in many ways deeply fulfilling and enlightening. However, the toll has been weighing on me and Miles’ article really pushed me over the edge towards action. I used the Myclimate.org carbon footprint calculator to compute the environmental impact of my flights just for the first half of 2019, and it was mind-boggling: more than 23 tons of CO2. For comparison, my entire household’s yearly carbon footprint (estimated using https://www3.epa.gov/carbon-footprint-calculator/) is just over 10 tons!

For these reasons, I am committing to eliminate (to the greatest degree possible) academic air travel for the foreseeable future. That means no air travel for talks, conferences, or meetings — instead participating by telepresence whenever possible.

I’m sympathetic to the sentiment, and have considered reducing my own travel on a couple of occasions. So far, I’ve decided not to. It’s not that I disagree with Russ’s reasoning; I’m very much on board with the motivation for cutting travel, and I think we should all try to do our part to try and help avert, or at least mitigate, the looming climate crisis. The question for me is how to best go about that. While I haven’t entirely ruled out cutting down on flying in future, it’s not something I’m terribly eager to do. Travel is one of the most rewarding and fulfilling parts of my job, and I’m loathe to stop flying unless there are no other ways to achieve the same or better outcomes.

Fortunately, there are other things one can do to try to keep the planet nice and habitable for all of us. For my part, I’ve decided that, rather than cutting back on travel, I’m going to give some money to charity. And in the next ~4,000 words, I’m going to try to convince you to consider doing the same (at least the giving to charity part; I’m not going to try to argue you out of not flying).

Now, I know what you’re thinking at this point. You’re probably thinking, what does he mean, ‘charity’? Is he talking about carbon offsets? He’s talking about carbon offsets, isn’t he! This dude is about to write a 4,000-word essay telling me I should buy carbon offsets! Like I don’t already know they’re a gigantic scam. No thanks.

Congratulations, my friend—you’re right. I do mean carbon offsets.

Well, kind of.

What I’ll really try to convince you is that, while carbon offsets are actually a perfectly reasonable thing for a person to purchase, the idea of "offsetting" one’s lifestyle choices is probably not the most helpful way to think about the more general fight against climate change. But carbon offsets as they’re usually described—i.e., as promissory notes you can purchase from organizations that claim to suck up a certain amount of carbon out of the atmosphere by planting trees or engaging in other similarly hippie activities—are a good place to start, because pretty much everyone’s heard of them. So let me start by rebutting what I see as the two (or really three) most common arguments against offsets, and in the process, it’ll hopefully become clear why offsets are a bit of a red herring that’s best understood as just a special case of a much more general principle. The general principle being, if you want to save yourself a bunch of reading, do a little bit of research, and then give as much as you comfortably can.

Offsets as indulgences

Carbon offsets are frequently criticized on the grounds that they’re either (a) ineffective, or (b) morally questionable—amounting to a form of modern day indulgence. I can’t claim any particular expertise in either climate science or catholicism, but I can say that nothing I’ve read has convinced me of either argument.

Let’s take the indulgence argument first. Superficially, it may seem like carbon offsets are just a way for well-off people to buy their way out of their sins and into heaven (or across the Atlantic—whichever comes first). But, as David Roberts observed in an older Grist piece, there are some fairly important differences between the two things that make the analogy… not such a great one:

If there really were such a thing as sin, and there was a finite amount of it in the world, and it was the aggregate amount of sin that mattered rather than any individual’s contribution, and indulgences really did reduce aggregate sin, then indulgences would have been a perfectly sensible idea.

Roberts’s point is that when someone opts to buy carbon offsets before they get on a plane, the world still benefits from any resulting reduction in carbon release—it’s not like the money simply vanishes into the church’s coffers, never to be seen again, while the newly guilt-relieved traveler gets to go on their merry way. Maybe it feels like people are just taking the easy way out, but so what? There are plenty of other situations in which people opt to give someone else money in order to save themselves some time and effort—or for some other arbitrarily unsavory reason—and we don’t get all moralistic and say y’know, it’s not real parenting if you pay for a babysitter to watch your kids while you’re out, or how dare you donate money to cancer charities just so people see you as a good person. We all have imperfect motives for doing many of the things we do, but if your moral orientation is even slightly utilitarian, you should be able to decouple the motive for performing an action from the anticipated consequences of that action.

As a prominent example, none of us know what thought process went through Bill and Melinda Gates’s heads in the lead-up to their decision to donate the vast majority of their wealth to the Gates Foundation. But suppose it was something like "we want the world to remember us as good people" rather than "we want the world to be a better place", would anyone seriously argue that the Gateses shouldn’t have donated their wealth?

You can certainly argue that it’s better to do the right thing for the right reason than the right thing for the wrong reason, but to the extent that one views climate change as a battle for the survival of humanity (or some large subset of it), it seems pretty counterproductive to only admit soldiers into one’s army if they appear to have completely altruistic motives for taking up arms.

The argument from uncertainty

Then there’s the criticism that carbon offsets are ineffective. I think there are actually two variants of this argument—one from uncertainty, and one from inefficacy. The argument from uncertainty is that there’s just too much uncertainty associated with offset programs. That is, many people are understandably worried that when they donate their money to tree planting or cookstove-purchasing programs, they can’t know for sure that their investment will actually lead to a reduction in global carbon emissions, whereas when they reduce their air travel, they at least know that they’ve saved at least one ticket’s worth of emissions.

Now, it’s obviously true that offsets can be ineffective—if you give a charity some money to reduce carbon, and that charity proceeds to blow all your money on advertising, squirrels it away in an executive’s offshore account, or plants a bunch of trees that barely suck up any carbon, then sure, you have a problem. But the fact that it’s possible to waste money giving to a particular cause doesn’t mean it’s inevitable. If it did, nobody would ever donate money to any charity, because huge inefficiences are rampant. Similarly, there would be no basis for funding clean energy subsidies or emission-reducing technologies, seeing as the net long-term benefit of most such investments is virtually impossible to predict at the outset. Requiring certainty, or anything close to it, when seeking to do something good for the world is a solid recipe for doing almost nothing at all. Uncertainty about the consequences of our actions is just a fact of life, and there’s no reason to impose a higher standard here than in other areas.

Conversely, as intuitively appealing as the idea may be, trying to cut carbon by reducing one’s travel is itself very far from a sure thing. It would be nice to think that if the net carbon contribution of a given flight is estimated at X tons of CO2 per person, then the effect of a random person not getting on that flight is to reduce global CO2 levels by roughly X tons. But it doesn’t work quite that way. For one thing, it’s not as if abstaining from air travel instantly decreases the amount of carbon entering the atmosphere. Near term, the plane you would have traveled on is still going to take off, whether you’re on it or not. So if you decide to stay home, your action doesn’t actually benefit the environment in any way until such time as it (in concert with others’ actions) influences the broader air travel industry.

Will your actions eventually have a positive impact on the air travel industry? I don’t know. Probably. It seems reasonable to suppose that if a bunch of academics decide to stop flying, eventually, fewer planes will take off than otherwise would. What’s much less clear, though, is how many fewer. Will the effective CO2 savings be anywhere near the nominal figure that people like to float when estimating the impact of air travel—e.g., roughly 2 tons for a one-way transatlantic flight in economy? This I also don’t know, but it’s plausible to suppose they won’t. The reason is that your purchasing decisions don’t unfold in a vacuum. When an academic decides not to fly, United Airlines doesn’t say, "oh, I guess we have one less customer now." Instead, the airline—or really, its automated pricing system—says "I guess I’ll leave this low fare open a little bit longer". At a certain price point, the lower price will presumably induce someone to fly who otherwise wouldn’t.

Obviously, price elasticity has its limits. it may well be that, in the long term, the airlines can’t compensate for the drop in demand while staying solvent, and academics and other forward-thinking types get to take credit for saving the world. That’s possible. Alternatively, maybe it’s actually quite easy for airlines to create new, less conscientious, air travelers by lowering prices a little bit, and so the only real product of choosing to stay home is that you develop a bad case of FOMO while your friends are all out having fun learning new things at the conference. Which of these scenarios (or anything in between) happen to be true depends on a number of strong assumptions that, in general, I don’t think most academics, or even economists, have a solid grasp on (I certainly don’t pretend to).

To be clear, I’m not suggesting that the net impact of not flying is negative (that would surprise me), or that academics shouldn’t cut their air travel. I’m simply observing that there’s massive uncertainty about the effects of pretty much anything one could do to try fight climate change. This doesn’t mean we should give up and do nothing (if uncertainty about the future was a reason not to do things, most of us would never leave our house in the morning), but it does mean that perhaps naive cause-and-effect intuitions of the if-I-get-on-fewer-planes-the-world-will-have-less-CO2 variety are not the best guide to effective action.

The argument from inefficacy

The other variant of the argument is a bit stronger, and is about inefficacy rather than uncertainty: here, the idea is not just that we can’t be sure that offsetting works; it’s that we actually have positive evidence that offset programs don’t do what they claim. In support of this argument, people like to point to articles like this, or this, or this—all of which make the case that many organizations that nominally offer to take one’s money and use it to pull some carbon out of the environment (or prevent it from being released) are just not cost-effective.

For what it’s worth, I find many of these articles pretty convincing, and, for the sake of argument, I’m happy to take what many of them say about specific mechanisms of putative carbon reduction as gospel truth. The thing is, the conclusion they support is not that trying to reduce carbon through charitable giving doesn’t work; it’s that it’s easy to waste your money by giving to the wrong organization. This doesn’t mean you have to put your pocketbook away and go home; it just means you might have to invest a bit of time researching the options before you can feel comfortable that there’s a reasonable (again, not a certain!) chance that your donation will achieve its intended purpose.

This observation shouldn’t be terribly troubling to most people. Most of us are already willing to spend some time researching options online before we buy, say, a television; there’s no reason why we shouldn’t expect to do the same thing when trying to use our money to help mitigate environmental disaster. Yet, in conversation, when I’ve asked my academic friends who express cynicism about the value of offsets how much time they’ve actually spent researching the issue, the answer is almost invariably "none" or "not much". I think this is a bit of an odd response from smart people with fancy degrees who I know spend much of their waking life thinking deeply about complex issues. Academics, more than most other folks, should be well aware of the dangers of boiling down a big question like "what’s the best way to fight climate change by spending money?" to a simplistic assertion like "nothing; it can’t be done." But the fact that this kind of response is so common does suggest to me that maybe we should be skeptical of the reflexive complaint that charitable giving can’t mitigate carbon emissions.

Crucially, we don’t have to stop at a FUD-like statement like nobody really knows what helps, so in principle, carbon offsets could be just as effective as not flying. No, I think it’s trivial to demonstrate essentially from first principles that there must be many cost-effective ways to offset one’s emissions.

The argument here is simple: much of what governments and NGOs do to fight climate change isn’t about directly changing individual human beings’ consumption behaviors, but about pro-actively implementing policies or introducing technologies that indirectly affect those behaviors, or minimize their impacts. Make a list of broad strategies, and you find things like:

  • Develop, incentivize and deploy clean energy sources.
  • Introduce laws and regulations that encourage carbon emission reduction (e.g., via forest preservation, congestion pricing, consumption taxes, etc.).
  • Offer financial incentives for farmers, loggers, and other traditional industrial sources of carbon to develop alternative income streams.
  • Fund public awareness campaigns to encourage individual lifestyle changes.
  • Fund research into blue-sky technologies that efficiently pull carbon out of the atmosphere and safely sequester it.

You can probably go on like this for a long time.

Now, some of the items on this list may be hard to pursue effectively unless you’re a government. But in most of these cases, there’s already a healthy ecosystem of NGOs working to make the world a better place. And there’s zero reason to think that it’s just flatly impossible for any of these organizations to be more effective than whatever benefit you think the environment derives from people getting on fewer planes.

On the contrary: it requires very little imagination to see how, say, a charity staffed by lawyers who help third-world governments draft and lobby for critical environment laws might have an environmental impact measured in billions of dollars, even if its budget is only in the millions. Or, if science is your thing, to believe that publicly-funded researchers working on clean energy do occasionally succeed at developing technologies that, when deployed at scale, provide societal returns many times the cost of the original research.

Once you frame it this way—and I honestly don’t know how one would argue against this way of looking at things—it seems pretty clear that blanket statements like "carbon offsets don’t work" are kind of dumb—or at least, intellectually lazy. If what you mean by "carbon offsets don’t work" is the much narrower claim that most tree-planting campaigns aren’t cost-effective, then sure, maybe that’s true. My impression is that many environmental economists would be happy to agree with you. But that narrow statement has almost no bearing on the question of whether or not you can cost-effectively offset the emissions you’d produce by flying. If somebody offered you credible evidence that their organization could reduce enough carbon emissions to offset your transatlantic flight for the princely sum of $10, I hope you wouldn’t respond by saying well, I read your brochure, and I buy all the evidence you presented, but it said nothing about trees anywhere, so I’m afraid I’m going to have to reject your offer and stay home.

The fact of the matter is that there are thousands, if not tens of thousands, of non-profit organizations currently working to fight climate change. They’re working on the problem in many different ways: via policy efforts, technology development, reforestation, awareness-raising, and any number of other avenues. Some of these organizations are undoubtedly fraudulent, bad at what they do, or otherwise a waste of your money. But it’s inconceivable to think that there aren’t some charities out there—and probably a large number, in absolute terms—that are very effective at what they do, and certainly far more effective than whatever a very high-flying individual can achieve by staying off the runways and saving a couple dozen tons of CO2 per year. And you don’t even need there to be a large number of such organizations; you just need to find one of them.

Do you really find it so hard to believe that there are such organizations out there? And that there are also quite a few people whose day job is identifying those organizations, precisely so that people like you and I can come along and give them money?

I don’t.

So how should you spend your money?

Supposing you find the above plausible, you might be thinking, okay, fine, maybe offsetting does work, as long as you’re smart about how you do it—now please tell me who to make a check out to so I can keep drinking terrible hotel coffee and going to poster sessions that make me want to claw my eyes out.

Well I hate to disappoint you, but I’m not entirely comfortable telling you what you should do with your money (I mean, if you insist on an answer, I’ll probably tell you to give it to me). What I can do is tell you is what I’ve done with mine.

A few months ago, I set aside an evening and spent a few hours reading up on various climate-focused initiatives, Then I ended up donating money to the Clean Air Task Force and the Coalition for Rainforest Nations. Both of these are policy-focused organizations; they don’t plant tree saplings or buy anyone a clean-burning stove. They fight climate change by attempting to influence policy in ways that promote, respectively, clean air in the United States, and preservation of the world’s rain forests. They are also, not coincidentally, the two organizations strongly recommended by Founders Pledge—an organization dedicated to identifying effective ways for technology founders (but really, pretty much anyone) to spend their money for the benefit of society.

My decision to give to these organizations was motivated largely by this Founders Pledge report, which I think compellingly argues that these organizations likely offer a much better return on one’s investment than most others. The report estimates a cost of $0.02 – $0.72 per ton of CO2 release averted when donating to the Coalition for Rainforest Nations (the cost is somewhat higher for the Clean Air Task Force). For reference, typical estimates suggest that a single one-way economy-class transatlantic plane ticket introduces perhaps 2 – 3 tons of CO2 to the atmosphere. So, even at the conservative end of Founders Pledge’s "realistic" estimate, you’d need to give CfRN only around $2 to offset that cost. Personally, I’m a skeptical kind of person, so I don’t take such estimates at face value. When I see this kind of number, I immediately multiply it by a factor of 10, because I know how the winner’s curse works. In this case, that still leaves you with an estimate of $20/ton—a number I’m perfectly happy with personally, and that seems to me quite manageable for almost anybody who can afford to get on a transatlantic flight in the first place.

Am I sure that my donations to the above organizations will ultimately do the environment some good? No.

Do I feel confident that these are the best charities out there? Of course not. It’s hard to imagine that they could be, given the sheer number of organizations in this space. But again, certainty is a wildly unrealistic desideratum here. What I am satisfied with is that I’ve done my due diligence, and that in my estimation, I’ve identified a plausibly effective mechanism though which I can do a very tiny bit of good for the world (well, two mechanisms—the other one is this blog post I’m writing, which will hopefully convince at least one other person to take similar action).

I’m not suggesting that anyone else has to draw the same conclusions I have, or donate to the same organizations. Your mileage will probably vary. If, after doing some research, you decide that in your estimation, not flying still makes the most sense, great. And if you decide that actually none of this climate stuff is likely to help, and instead, you’re going to give your money to charities that work on AI alignment or malaria, great. But at the very least, I hope it’s clear there’s really no basis for simply dismissing, out of hand, the notion that one can effectively help reduce atmospheric CO2—on a very, very tiny scale, obviously—via financial means, rather than solely through lifestyle changes.

Why stop at offsets?

So far I’ve argued that donating your money to climate-focused organizations (done thoughtfully) is a perfectly acceptable alternative to cutting back on travel, if your goal is to ultimately reduce atmospheric carbon. If you want to calculate the amount of money you’d need to give to the organization of your choice in order to offset the carbon that your travel (or, more generally, lifestyle) introduces every year, and give exactly that much, great.

But I want to go a bit further than that. What I really want to suggest is that if you’re very concerned about the environment, donating your money can actually be a much better thing to do than just minimizing your own footprint.

The major advantage of charitable giving, unlike travel reduction, or really any kind of lifestyle change, is that there’s a much higher ceiling on what you can accomplish. When you try to fight global warming by avoiding travel, the best you can do is eliminate all of your own personal travel. That may not be trivial, and I think it’s certainly worth doing if your perceived alternative is doing nothing at all. Still, there’s always going to be a hard limit on your contribution. It’s not like you can remove arbitrarily large quantities of carbon from the environment by somehow, say, negatively traveling.

By contrast, when you give money, you don’t have to stop at just offsetting your own carbon production; in principle, you can pay to offset other people’s production too. If you have some discretionary income, and believe that climate change really is an existential threat to the human species (or some large subset of it), then on on some level it seems a bit strange to say, "I just want to make sure I personally don’t produce more carbon than the average human being living in my part of the world; beyond that, it’s other people’s problem." If you believe that climate change presents an existential threat to your descendants, or at least to their quality of life, and you can afford to do more than just reduce your own carbon footprint, why not use more of your resources to try and minimize the collective impact of humanity’s past poor environmental decisions? I’m not saying anyone has a moral obligation to do that; I don’t think they do. But it doesn’t seem like a crazy thing to do, if you have some money to spare.

You can still fly less!

Before I go, let me circle around to where I started. I want to emphasize that nothing I’ve said here is intended as criticism of what Russ Poldrack wrote, or of the anti-flying movement more generally. Quite the opposite: I think Russ Poldrack is a goddamn hero (and not just for his position on this issue). If not for Russ’s post, and subsequent discussions on social media, I doubt I would have been sufficiently motivated to put my own money where my mouth is on this issue, let alone to write this post (as an otherwise fairly selfish person, I’m not ashamed to say that I wrote this post in part to force myself to give a serious chunk of cash to charity—public commitment is a powerful thing!). So I’m very much on board with the initiative: other things equal, I think cutting back on one’s air travel is a good thing to do. All I’m saying here is that there are other ways one can do one’s part in the fight against climate change that don’t require giving up air travel—and that, if anything, have the potential to exert far greater (though admittedly still tiny in the grand scheme of things) impact.

It also goes without saying that the two approaches are not mutually exclusive. On the contrary, the best-case scenario is that most people cut their air travel and give money to organizations working to mitigate climate change. But since nobody is perfect, everything is commensurable, and people have different preferences and resource constraints, I take it for granted that most people (me included) aren’t going to do both, and I think that’s okay. It seems perfectly reasonable to me to feel okay about your relationship with the environment so long as you’re doing something. I respect people who opt to do their part by cutting down on their air travel. But I’m not going to feel guilty for continuing to fly around the world fairly regularly, because I think I’m doing my part too.

The parable of the three districts: A projective test for psychologists

A political candidate running for regional public office asked a famous political psychologist what kind of television ads she should air in three heavily contested districts: positive ones emphasizing her own record, or negative ones attacking her opponent’s record.

“You’re in luck,“ said the psychologist. “I have a new theory of persuasion that addresses exactly this question. I just published a paper containing four large studies that all strongly support the theory and show that participants are on average more persuaded by attack ads than by positive ones.“

Convinced by the psychologist’s arguments and his confident demeanor, the candidate’s campaign ran carefully tailored attack ads in all three districts. She proceeded to lose the race by a landslide, with exit surveys placing much of the blame on the negative tone of her ads.

As part of the campaign post-mortem, the candidate asked the psychologist what he thought had gone wrong.

“Oh, different things,“ said the psychologist. “In hindsight, the first district was probably too educated; I could see how attack ads might turn off highly educated voters. In the second district““and I’m not going to tiptoe around the issue here—I think the problem was sexism. You have a lot of low-SES working-class men in that district who probably didn’t respond well to a female candidate publicly criticizing a male opponent. And in the third district, I think the ads you aired were just too over the top. You want to highlight your opponent’s flaws subtly, not make him sound like a cartoon villain.“

“That all sounds reasonable enough,“ said the candidate. “But I’m a bit perplexed that you didn’t mention any of these subtleties ahead of time, when they might have been more helpful.“

“Well,“ said the psychologist. “That would have been very hard to do. The theory is true in general, you see. But every situation is different.“

I hate open science

Now that I’ve got your attention: what I hate—and maybe dislike is a better term than hate—isn’t the open science community, or open science initiatives, or open science practices, or open scientists… it’s the term. I fundamentally dislike the term open science. For the last few years, I’ve deliberately tried to avoid using it. I don’t call myself an open scientist, I don’t advocate publicly for open science (per se), and when people use the term around me, I often make a point of asking them to clarify what they mean.

This isn’t just a personal idiosyncracy of mine in a chalk-on-chalkboard sense; I think at this point in time there are good reasons to think the continued use of the term is counterproductive, and we should try to avoid it in most contexts. Let me explain.

It’s ambiguous

At SIPS 2019 last week (SIPS is the Society for Improvement of Psychological Science), I had a brief chat with a British post-undergrad student who was interested in applying to graduate programs in the United States. He asked me what kind of open science community there was at my home institution (the University of Texas at Austin). When I started to reply, I realized that I actually had no idea what question the student was asking me, because I didn’t know his background well enough to provide the appropriate context. What exactly did he mean by “open science”? The term is now used so widely, and in so many different ways, that the student could plausibly have been asking me about any of the following things, either alone or in combination:

  • Reproducibility. Do people [at UT-Austin] value the ability to reproduce, computationally and/or experimentally, the scientific methods used to produce a given result? More concretely, do they conduct their analyses programmatically, rather than using GUIs? Do they practice formal version control? Are there opportunities to learn these kinds of computational skills?
  • Accessibility. Do people believe in making their scientific data, materials, results, papers, etc. publicly, freely, and easily available? Do they work hard to ensure that other scientists, funders, and the taxpaying public can easily get access to what scientists produce?
  • Incentive alignment. Are there people actively working to align individual incentives and communal incentives, so that what benefits an individual scientist also benefits the community at large? Do they pursue local policies meant to promote some of the other practices one might call part of “open science”?
  • Openness of opinion. Do people feel comfortable openly critiquing one another? Is there a culture of discussing (possibly trenchant) problems openly, without defensiveness? Do people take discussion on social media and post-publication review forums seriously?
  • Diversity. Do people value and encourage the participation in science of people from a wide variety of ethnicities, genders, skills, personalities, socioeconomic strata, etc.? Do they make efforts to welcome others into science, invest effort and resources to help them succeed, and accommodate their needs?
  • Metascience and informatics. Are people thinking about the nature of science itself, and reflecting on what it takes to promote a healthy and productive scientific enterprise? Are they developing systematic tools or procedures for better understanding the scientific process, or the work in specific scientific domains?

This is not meant to be a comprehensive list; I have no doubt there are other items one could add (e.g., transparency, collaborativeness, etc.). The point is that open science is, at this point, a very big tent. It contains people who harbor a lot of different values and engage in many different activities. While some of these values and activities may tend to co-occur within people who call themselves open scientists, many don’t. There is, for instance, no particular reason why someone interested in popularizing reproducible science methods should also be very interested in promoting diversity in science. I’m not saying there aren’t people who want to do both (of course there are); empirically, there might even be a modest positive correlation—I don’t know. But they clearly don’t have to go together, and plenty of people are far more invested in one than in the other.

Further, as in any other enterprise, if you monomaniacally push a single value hard enough, then at a certain point, tensions will arise even between values that would ordinarily co-exist peacefully if each given only partial priority. For example, if you think that doing reproducible science well requires a non-negotiable commitment to doing all your analyses programmatically, and maintaining all your code under public version control, then you’re implicitly condoning a certain reduction in diversity within science, because you insist on having only people with a certain set of skills take part in science, and people from some backgrounds are more likely than others (at least at present) to have those skills. Conversely, if diversity in science is the thing you value most, then you need to accept that you’re effectively downgrading the importance of many of the other values listed above in the research process, because any skill or ability you might use to select or promote people in science is necessarily going to reduce (in expectation) the role of other dimensions in the selection process.

This would be a fairly banal and inconsequential observation if we lived in a world where everyone who claimed membership in the open science community shared more or less the same values. But we clearly don’t. In highlighting the ambiguity of the term open science, I’m not just saying hey, just so you know, there are a lot of different activities people call open science; I’m saying that, at this point in time, there are a few fairly distinct sub-communities of people that all identify closely with the term open science and use it prominently to describe themselves or their work, but that actually have fairly different value systems and priorities.

Basically, we’re now at the point where, when someone says they’re an open scientist, it’s hard to know what they actually mean.

It wasn’t always this way; I think ten or even five years ago, if you described yourself as an open scientist, people would have identified you primarily with the movement to open up access to scientific resources and promote greater transparency in the research process. This is still roughly the first thing you find on the Wikipedia entry for Open Science:

Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of an inquiring society, amateur or professional. Open science is transparent and accessible knowledge that is shared and developed through collaborative networks. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge.

That was a fine definition once upon a time, and it still works well for one part of the open science community. But as a general, context-free definition, I don’t think it flies any more. Open science is now much broader than the above suggests.

It’s bad politics

You might say, okay, but so what if open science is an ambiguous term; why can’t that be resolved by just having people ask for clarification? Well, obviously, to some degree it can. My response to the SIPS student was basically a long and winding one that involved a lot of conditioning on different definitions. That’s inefficient, but hopefully the student still got the information he wanted out of it, and I can live with a bit of inefficiency.

The bigger problem though, is that at this point in time, open science isn’t just a descriptive label for a set of activities scientists often engage in; for many people, it’s become an identity. And, whatever you think the value of open science is as an extensional label for a fairly heterogeneous set of activities, I think it makes for terrible identity politics.

There are two reasons for this. First, turning open science from a descriptive label into a full-blown identity risks turning off a lot of scientists who are either already engaged in what one might otherwise call “best practices”, or who are very receptive to learning such practices, but are more interested in getting their science done than in discussing the abstract merits of those practices or promoting their use to others. If you walk into a room and say, in the next three hours, I’m going to teach you version control, and there’s a good chance this could really help your research, probably quite a few people will be interested. If, on the other hand, you walk into the room and say, let me tell you how open science is going to revolutionize your research, and then proceed to either mention things that a sophisticated audience already knows, or blitz a naive audience with 20 different practices that you describe as all being part of open science, the reception is probably going to be frostier.

If your goal is to get people to implement good practices in their research—and I think that’s an excellent goal!—then it’s not so clear that much is gained by talking about open science as a movement, philosophy, culture, or even community (though I do think there are some advantages to the latter). It may be more effective to figure out who your audience is, what some of the low-hanging fruit are, and focus on those. Implying that there’s an all-or-none commitment—i.e., one is either an open scientist or not, and to be one, you have to buy into a whole bunch of practices and commitments—is often counterproductive.

The second problem with treating open science as a movement or identity is that the diversity of definitions and values I mentioned above almost inevitably leads to serious rifts within the broad open science community—i.e., between groups of people who would have little or no beef with one another if not for the mere fact that they all happen to identify as open scientists. If you spend any amount of time on social media following people whose biography includes the phrases “open science” or “open scientist”, you’ll probably know what I’m talking about. At a rough estimate, I’d guess that these days maybe 10 – 20% of tweets I see in my feed containing the words “open science” are part of some ongoing argument between people about what open science is, or who is and isn’t an open scientist, or what’s wrong with open science or open scientists—and not with substantive practices or applications at all.

I think it’s fair to say that most (though not all) of these arguments are, at root, about deep-seated differences in the kinds of values I mentioned earlier. People care about different things. Some people care deeply about making sure that studies can be accurately reproduced, and only secondarily or tertiarily about the diversity of the people producing those studies. Other people have the opposite priorities. Both groups of people (and there are of course many others) tend to think their particular value system properly captures what open science is (or should be) all about, and that the movement or community is being perverted or destroyed by some other group of people who, while perhaps well-intentioned (and sometimes even this modicum of charity is hard to find), just don’t have their heads screwed on quite straight.

This is not a new or special thing. Any time a large group of people with diverse values and interests find themselves all forced to sit under a single tent for a long period of time, divisions—and consequently, animosity—will eventually arise. If you’re forced to share limited resources or audience attention with a group of people who claim they fill the same role in society that you do, but who you disagree with on some important issues, odds are you’re going to experience conflict at some point.

Now, in some domains, these kinds of conflicts are truly unavoidable: the factors that introduce intra-group competition for resources, prestige, or attention are structural, and resolving them without ruining things for everyone is very difficult. In politics, for example, one’s nominal affiliation with a political party is legitimately kind of a big deal. In the United States, if a splinter group of disgruntled Republican politicians were to leave their party and start a “New Republican” party, they might achieve greater ideological purity and improve their internal social relations, but the new party’s members would also lose nearly all of their influence and power pretty much overnight. The same is, of course, true for disgruntled Democrats. The Nash equilibrium is, presently, for everyone to stay stuck in the same dysfunctional two-party system.

Open science, by contrast, doesn’t really have this problem. Or at least, it doesn’t have to have this problem. There’s an easy way out of the acrimony: people can just decide to deprecate vague, unhelpful terms like “open science” in favor of more informative and less controversial ones. I don’t think anything terrible is going to happen if someone who previously described themselves as an “open scientist” starts avoiding that term and instead opts to self-describe using more specific language. As I noted above, I speak from personal experience here (if you’re the kind of person who’s more swayed by personal anecdotes than by my ironclad, impregnable arguments). Five years ago, my talks and papers were liberally sprinkled with the term “open science”. For the last two or three years, I’ve largely avoided the term—and when I do use it, it’s often to make the same point I’m making here. E.g.,:

For the most part, I think I’ve succeeded in eliminating open science from my discourse in favor of more specific terms like reproducibility, transparency, diversity, etc. Which term I use depends on the context. I haven’t, so far, found myself missing the term “open”, and I don’t think I’ve lost brownie points in any club for not using it more often. I do, on the other hand, feel very confident that (a) I’ve managed to waste fewer people’s time by having to follow up vague initial statements about “open” things with more detailed clarifications, and (b) I get sucked into way fewer pointless Twitter arguments about what open science is really about (though admittedly the number is still not quite zero).

The prescription

So here’s my simple prescription for people who either identify as open scientists, or use the term on a regular basis: Every time you want to use the term open science—in your biography, talk abstracts, papers, tweets, conversation, or whatever else—pause and ask yourself if there’s another term you could substitute that would decrease ambiguity and avoid triggering never-ending terminological arguments. I’m not saying that the answer will always be yes. If you’re confident that the people you’re talking to have the same definition of open science as you, or you really do believe that nobody should ever call themselves an open scientist unless they use git, then godspeed—open science away. But I suspect that for most uses, there won’t be any such problem. In most instances, “open science” can be seamlessly replaced with something like “reproducibility”, “transparency”, “data sharing”, “being welcoming”, and so on. It’s a low-effort move, and the main effect of making the switch is that other people will have a clearer understanding of what you mean, and may be less inclined to argue with you about it.

Postscript

Some folks on twitter were concerned that this post makes it sound as if I’m passing off prior work and ideas as my own (particularly as relates to the role of diversity in open science). So let me explicitly state here that I don’t think any of the ideas expressed in this post are original to me in any way. I’ve heard most (if not all) expressed many times by many people in many contexts, and this post just represents my effort to distill them into a clear summary of my views.

No, it’s not The Incentives—it’s you

There’s a narrative I find kind of troubling, but that unfortunately seems to be growing more common in science. The core idea is that the mere existence of perverse incentives is a valid and sufficient reason to knowingly behave in an antisocial way, just as long as one first acknowledges the existence of those perverse incentives. The way this dynamic usually unfolds is that someone points out some fairly serious problem with the way many scientists behave—say, our collective propensity to p-hack as if it’s going out of style, or the fact that we insist on submitting our manuscripts to publishers that are actively trying to undermine our interests—and then someone else will say, “I know, right—but what are you going to do, those are the incentives.”

As best I can tell, the words “it’s the incentives” are magic. Once they’re uttered by someone, natural law demands that everyone else involved in the conversation immediately stop whatever else they were doing, solemnly nod, and mumble something to the effect that, yes, the incentives are very bad, very bad indeed, and it’s a real tragedy that so many smart, hard-working people are being crushed under the merciless, gigantic boot of The System. Then there’s usually a brief pause, and after that, everyone goes back to discussing whatever they were talking about a moment earlier.

Perhaps I’m getting senile in my early middle age, but my anecdotal perception is that it used to be that, when somebody pointed out to a researcher that they might be doing something questionable, that researcher would typically either (a) argue that they weren’t doing anything questionable (often incorrectly, because there used to be much less appreciation for some of the statistical issues involved), or (b) look uncomfortable for a little while, allow an awkward silence to bloom, and then change the subject. In the last few years, I’ve noticed that uncomfortable discussions about questionable practices disproportionately seem to end with a chuckle or shrug, followed by a comment to the effect that we are all extremely sophisticated human beings who recognize the complexity of the world we live in, and sure it would be great if we lived in a world where one didn’t have to occasionally engage in shenanigans, but that would be extremely naive, and after all, we are not naive, are we?

There is, of course,  an element of truth to this kind of response. I’m not denying that perverse incentives exist; they obviously do. There’s no question that many aspects of modern scientific culture systematically incentivize antisocial behavior, and I don’t think we can or should pretend otherwise. What I do object to quite strongly is the narrative that scientists are somehow helpless in the face of all these awful incentives—that we can’t possibly be expected to take any course of action that has any potential, however small, to impede our own career development.

“I would publish in open access journals,” your friendly neighborhood scientist will say. “But those have a lower impact factor, and I’m up for tenure in three years.”

Or: “if I corrected for multiple comparisons in this situation, my effect would go away, and then the reviewers would reject the paper.”

Or: “I can’t ask my graduate students to collect an adequately-powered replication sample; they need to publish papers as quickly as they can so that they can get a job.”

There are innumerable examples of this kind, and they’ve become so routine that it appears many scientists have stopped thinking about what the words they’re saying actually mean, and instead simply glaze over and nod sagely whenever the dreaded Incentives are invoked.

A random bystander who happened to eavesdrop on a conversation between a group of scientists kvetching about The Incentives could be forgiven for thinking that maybe, just maybe, a bunch of very industrious people who generally pride themselves on their creativity, persistence, and intelligence could find some way to work around, or through, the problem. And I think they would be right. The fact that we collectively don’t see it as a colossal moral failing that we haven’t figured out a way to get our work done without having to routinely cut corners in the rush for fame and fortune is deeply troubling.

It’s also aggravating on an intellectual level, because the argument that we’re all being egregiously and continuously screwed over by The Incentives is just not that good. I think there are a lot of reasons why researchers should be very hesitant to invoke The Incentives as a justification for why any of us behave the way we do. I’ll give nine of them here, but I imagine there are probably others.

1. You can excuse anything by appealing to The Incentives

No, seriously—anything. Once you start crying that The System is Broken in order to excuse your actions (or inactions), you can absolve yourself of responsibility for all kinds of behaviors that, on paper, should raise red flags. Consider just a few behaviors that few scientists would condone:

  • Fabricating data or results
  • Regulary threatening to fire trainees in order to scare them into working harder
  • Deliberately sabotaging competitors’ papers or grants by reviewing them negatively

I think it’s safe to say most of us consider such practices to be thoroughly immoral, yet there are obviously people who engage in each of them. And when those people are caught or confronted, one of the most common justifications they fall back on is… you guessed it: The Incentives! When Diederik Stapel confessed to fabricating the data used in over 50 publications, he didn’t explain his actions by saying “oh, you know, I’m probably a bit of a psychopath”; instead, he placed much of the blame squarely on The Incentives:

I did not withstand the pressure to score, to publish, the pressure to get better in time. I wanted too much, too fast. In a system where there are few checks and balances, where people work alone, I took the wrong turn. I want to emphasize that the mistakes that I made were not born out of selfish ends.

Stapel wasn’t acting selfishly, you see… he was just subject to intense pressures. Or, you know, Incentives.

Or consider these quotes from a New York Times article describing Stapel’s unraveling:

In his early years of research — when he supposedly collected real experimental data — Stapel wrote papers laying out complicated and messy relationships between multiple variables. He soon realized that journal editors preferred simplicity. “They are actually telling you: ‘Leave out this stuff. Make it simpler,'” Stapel told me. Before long, he was striving to write elegant articles.

The experiment — and others like it — didn’t give Stapel the desired results, he said. He had the choice of abandoning the work or redoing the experiment. But he had already spent a lot of time on the research and was convinced his hypothesis was valid. “I said — you know what, I am going to create the data set,” he told me.

Reading through such accounts, it’s hard to avoid the conclusion that Stapel’s self-narrative is strikingly similar to the one that gets tossed out all the time on social media, or in conference bar conversations: here I am, a good scientist trying to do an honest job, and yet all around me is a system that incentivizes deception and corner-cutting. What do you expect me to do?.

Curiously, I’ve never heard any of my peers—including many of the same people who are quick to invoke The Incentives to excuse their own imperfections—seriously endorse The Incentives as an acceptable justification for Stapel’s behavior. In Stapel’s case, the inference we overwhelmingly jump to is that there must be something deeply wrong with Stapel, seeing as the rest of us also face the same perverse incentives on a daily basis, yet we somehow manage to get by without fabricating data. But this conclusion should make us a bit uneasy, I think, because if it’s correct (and I think it is), it implies that we aren’t really such slaves to The Incentives after all. When our morals get in the way, we appear to be perfectly capable of resisting temptation. And I mean, it’s not even like it’s particularly difficult; I doubt many researchers actively have to fight the impulse to manipulate their data, despite the enormous incentives to do so. I submit that the reason many of us feel okay doing things like reporting exploratory results as confirmatory results, or failing to mention that we ran six other studies we didn’t report, is not really that The Incentives are forcing us to do things we don’t like, but that it’s easier to attribute our unsavory behaviors to unstoppable external forces than to take responsibility for them and accept the consequences.

Needless to say, I think this kind of attitude is fundamentally hypocritical. If we’re not comfortable with pariahs like Stapel blaming The Incentives for causing them to fabricate data, we shouldn’t use The Incentives as an excuse for doing things that are on the same spectrum, albeit less severe. If you think that what the words “I did not withstand the pressure to score” really mean when they fall out of Stapel’s mouth is something like “I’m basically a weak person who finds the thought of not being important so intolerable I’m willing to cheat to get ahead”, then you shouldn’t give yourself a free pass just because when you use that excuse, you’re talking about much smaller infractions. Consider the possibility that maybe, just like Stapel, you’re actually appealing to The Incentives as a crutch to avoid having to make your life very slightly more difficult.

2. It would break the world if everyone did it

When people start routinely accepting that The System is Broken and The Incentives Are Fucking Us Over, bad things tend to happen. It’s very hard to have a stable, smoothly functioning society once everyone believes (rightly or wrongly) that gaming the system is the only way to get by. Imagine if every time you went to your doctor—and I’m aware that this analogy won’t work well for people living outside the United States—she sent you to get a dozen expensive and completely unnecessary medical tests, and then, when prompted for an explanation, simply shrugged and said “I know I’m not an angel—but hey, them’s The Incentives.” You would be livid—even though it’s entirely true (at least in the United States; other developed countries seem to have figured this particular problem out) that many doctors have financial incentives to order unnecessary tests.

To be clear, I’m not saying perverse incentives never induce bad behavior in medicine or other fields. Of course they do. My point is that practitioners in other fields at least appear to have enough sense not to loudly trumpet The Incentives as a reasonable justification for their antisocial behavior—or to pat themselves on the back for being the kind of people who are clever enough to see the fiendish Incentives for exactly what they are. My sense is that when doctors, lawyers, journalists, etc. fall prey to The Incentives, they generally consider that to be a source of shame. I won’t go so far as to suggest that we scientists take pride in behaving badly—we obviously don’t—but we do seem to have collectively developed a rather powerful form of learned helplessness that doesn’t seem to be matched by other communities. Which is a fortunate thing, because if every other community also developed the same attitude, we would be in a world of trouble.

3. You are not special

Individual success in science is, to a first approximation, a zero-sum game—at least in the short term. many scientists who appeal to The Incentives seem to genuinely believe that opting out of doing the right thing is a victimless crime. I mean, sure, it might make the system a bit less efficient overall… but that’s just life, right? It’s not like anybody’s actually suffering.

Well yeah, people actually do suffer. There are many scientists who are willing to do the right things—to preregister their analysis plans, to work hard to falsify rather than confirm their hypotheses, to diligently draw attention to potential confounds that complicate their preferred story, and so on. When you assert your right to opt out of these things because apparently your publications, your promotions, and your students are so much more important than everyone else’s, you’re cheating those people.

No, really, you are. If you don’t like to think of yourself as someone who cheats other people, don’t reflexively collapse on a crutch made out of stainless steel Incentives any time someone questions your process. You are not special. Your publications, job, and tenure are not more important than other people’s. The fact that there are other people in your position engaging in the same behaviors doesn’t mean you and your co-authors are all very sophisticated, and that the people who refuse to cut corners are naive simpletons. What it actually demonstrates is that, somewhere along the way, you developed the reflexive ability to rationalize away behavior that you would disapprove of in others and that, viewed dispassionately, is clearly damaging to science.

4. You (probably) have no data

It’s telling that appeals to The Incentives are rarely supported by any actual data. It’s simply taken for granted that engaging in the practice in question would be detrimental to one’s career. The next time you’re tempted to blame The System for making you do bad things, you might want to ask yourself this: Do you actually know that, say, publishing in PLOS ONE rather than [insert closed society journal of your choice] would hurt your career? If so, how do you know that? Do you have any good evidence for it, or have you simply accepted it as stylized fact?

Coming by the kind of data you’d need to answer this question is actually not that easy: it’s not enough to reflexively point to, say, the fact that some journals have higher impact factors than others, To identify the utility-maximizing course of action, you’d need to integrate over both benefits and costs, and the costs are not always so obvious. For example, the opportunity cost of submitting your paper to a “good” journal will be offset to some extent by the likelihood of faster publication (no need to spend two years racking up rejections at high-impact venues), by the positive image you send to at least some of your peers that you support open scientific practices, and so on.

I’m not saying that a careful consideration of the pros and cons of doing the right thing would usually lead people to change their minds. It often won’t. What I’m saying is that people who blame The Incentives for forcing them to submit their papers to certain journals, to tell post-hoc stories about their work, or to use suboptimal analytical methods don’t generally support their decisions with data, or even with well-reasoned argument. The defense is usually completely reflexive—which should raise our suspicion that it’s also just a self-serving excuse.

5. It (probably) won’t matter anyway

This one might hurt a bit, but I think it’s important to consider—particularly for early-career researchers. Let’s suppose you’re right that doing the right thing in some particular case would hurt your career. Maybe it really is true that if you comprehensively report in your paper on all the studies you ran, and not just the ones that “worked”, your colleagues will receive your work less favorably. In such cases it may seem natural to think that there has to be a tight relationship between the current decision and the global outcome—i.e., that if you don’t drop the failed studies, you won’t get a tenure-track position three years down the road. After all, you’re focusing on that causal relationship right now, and it seems so clear in your head!

Unfortunately (or perhaps fortunately?), reality doesn’t operate that way. Outcomes in academia are multiply determined and enormously complex. You can tell yourself that getting more papers out faster will get you a job if it makes you feel better, but that doesn’t make it true. If you’re a graduate student on the job market these days, I have sad news for you: you’re probably not getting a tenure-track job no matter what you do. It doesn’t matter how many p-hacked papers you publish, or how thinly you slice your dissertation into different “studies”; there are not nearly enough jobs to go around for everyone who wants one.

Suppose you’re right, and your sustained pattern of corner-cutting is in fact helping you get ahead. How far ahead do you think it’s helping you get? Is it taking you from a 3% chance of getting a tenure-track position at an R1 university to an 80% chance? Almost certainly not. Maybe it’s increasing that probability from 7% to 11%; that would still be a non-trivial relative increase, but it doesn’t change the fact that, for the average grad student, there is no full-time faculty position waiting at the end of the road. Despite what the environment around you may make you think, the choice most graduate students and postdocs face is not actually between (a) maintaining your integrity and “failing” out of science or (b) cutting a few corners and achieving great fame and fortune as a tenured professor. The Incentives are just not that powerful. The vastly more common choice you face as a trainee is between (a) maintaining your integrity and having a pretty low chance of landing a permanent research position, or (b) cutting a bunch of corners that threaten the validity of your work and having a slightly higher (but still low in absolute terms) chance of landing a permanent research position. And even that’s hardly guaranteed, because you never know when there’s someone on a hiring committee who’s going to be turned off by the obvious p-hacking in your work.

The point is, the world is complicated, and as a general rule, very few things—including the number of publications you produce—are as important as they seem to be when you’re focusing on them in the moment. If you’re an early-career researcher and you regularly find yourself strugging between doing what’s right and doing what isn’t right but (you think) benefits your career, you may want to take a step back and dispassionately ask yourself whether this integrity versus expediency conflict is actually a productive way to frame things. Instead, consider the alternative framing I suggested above: you are most likely going to leave academia eventually, no matter what you do, so why not at least try to see the process through with some intellectual integrity? And I mean, if you’re really so convinced that The System is Broken, why would you want to stay in it anyway? Do you think standards are going to change dramatically in the next few years? Are you laboring under the impression that you, of all people, are going to somehow save science?

This brings us directly to the next point…

6. You’re (probably) not going to “change things from the inside”

Over the years, I’ve talked to quite a few early-career researchers who have told me that while they can’t really stop engaging in questionable research practices right now without hurting their career, they’re definitely going to do better once they’re in a more established position. These are almost invariably nice, well-intentioned people, and I don’t doubt that they genuinely believe what they say. Unfortunately, what they say is slippery, and has a habit of adapting to changing circumstances. As a grad student or postdoc, it’s easy to think that once you get a faculty position, you’ll be able to start doing research the “right” way. But once you get a faculty position, it then turns out you need to get papers and grants in order to get tenure (I mean, who knew?), so you decide to let the dreaded Incentives win for just a few more years. And then, once you secure tenure, well, now the problem is that your graduate students also need jobs, just like you once did, so you can’t exactly stop publishing at the same rate, can you? Plus, what would all your colleagues think if you effectively said, “oh, you should all treat the last 15 years of my work with skepticism—that was just for tenure”?

I’m not saying there aren’t exceptions. I’m sure there are. But I can think of at least a half-dozen people off-hand who’ve regaled me with me some flavor of “once I’m in a better position” story, and none of them, to my knowledge, have carried through on their stated intentions in a meaningful way. And I don’t find this surprising: in most walks of life, course correction generally becomes harder, not easier, the longer you’ve been traveling on the wrong bearing. So if part of your unhealthy respect for The Incentives is rooted in an expectation that those Incentives will surely weaken their grip on you just as soon as you reach the next stage of your career, you may want to rethink your strategy. The Incentives are not going to dissipate as you move up the career ladder; if anything, you’re probably going to have an increasingly difficult time shrugging them off.

7. You’re not thinking long-term

One of the most frustrating aspects of appeals to The Incentives is that they almost invariably seem to focus exclusively on the short-to-medium term. But the long term also matters. And there, I would argue that The Incentives very much favor a radically different—and more honest—approach to scientific research. To see this, we need only consider the ongoing “replication crisis” in many fields of science. One thing that I think has been largely overlooked in discussions about the current incentive structure of science is what impact the replication crisis will have on the legacies of a huge number of presently famous scientists.

I’ll tell you what impact it will have: many of those legacies will be completely zeroed out. And this isn’t just hypothetical scaremongering. It’s happening right now to many former stars of psychology (and, I imagine, other fields I’m less familiar with). There are many researchers we can point to right now who used to be really famous (like, major-chunks-of-the-textbook famous), are currently famous-with-an-asterisk, and will in all likelihood, be completely unknown again within a couple of decades. The unlucky ones are probably even fated to become infamous—their entire scientific legacies eventually reduced to footnotes in cautionary histories illustrating how easily entire areas of scientific research can lose their footing when practitioners allow themselves to be swept away by concerns about The Incentives.

You probably don’t want this kind of thing to happen to you. I’m guessing you would like to retire with at least some level of confidence that your work, while maybe not Earth-shattering in its implications, isn’t going to be tossed on the scrap heap of history one day by a new generation of researchers amazed at how cavalier you and your colleagues once were about silly little things like “inferential statistics” and “accurate reporting”. So if your justification for cutting corners is that you can’t otherwise survive or thrive in the present environment, you should consider the prospect—and I mean, really take some time to think about it—that any success you earn within the next 10 years by playing along with The Incentives could ultimately make your work a professional joke within the 20 years after that.

8. It achieves nothing and probably makes things worse

Hey, are you a scientist? Yes? Great, here’s a quick question for you: do you think there’s any working scientist on Planet Earth who doesn’t already know that The Incentives are fucked up? No? I didn’t think so. Which means you really don’t need to keep bemoaning The Incentives; I promise you that you’re not helping to draw much-needed attention to an important new problem nobody’s recognized before. You’re not expressing any deep insight by pointing out that hiring committees prefer applicants with lots of publications in high-impact journals to applicants with a few publications in journals no one’s ever heard of. If your complaints are achieving anything at all, they’re probably actually making things worse by constantly (and incorrectly) reminding everyone around you about just how powerful The Incentives are.

Here’s a suggestion: maybe try not talking about The Incentives for a while. You could even try, I don’t know, working against The Incentives for a change. Or, if you can’t do that, just don’t say anything at all. Probably nobody will miss anything, and the early-career researchers among us might even be grateful for a respite from their senior colleagues’ constant reminder that The System—the very same system those senior colleagues are responsible for creating!—is so fucked up.

9. It’s your job

This last one seems so obvious it should go without saying, but it does need saying, so I’ll say it: a good reason why you should avoid hanging bad behavior on The Incentives is that you’re a scientist, and trying to get closer to the truth, and not just to tenure, is in your fucking job description. Taxpayers don’t fund you because they care about your career; they fund you to learn shit, cure shit, and build shit. If you can’t do your job without having to regularly excuse sloppiness on the grounds that you have no incentive to be less sloppy, at least have the decency not to say that out loud in a crowded room or Twitter feed full of people who indirectly pay your salary. Complaining that you would surely do the right thing if only these terrible Incentives didn’t exist doesn’t make you the noble martyr you think it does; to almost anybody outside your field who has a modicum of integrity, it just makes you sound like you’re looking for an easy out. It’s not sophisticated or worldly or politically astute, it’s just dishonest and lazy. If you find yourself unable to do your job without regularly engaging in practices that clearly devalue the very science you claim to care about, and this doesn’t bother you deeply, then maybe the problem is not actually The Incentives—or at least, not The Incentives alone. Maybe the problem is You.

If we already understood the brain, would we even know it?

The question posed in the title is intended seriously. A lot of people have been studying the brain for a long time now. Most of these people, if asked a question like “so when are you going to be able to read minds?”, will immediately scoff and say something to the effect of we barely understand anything about the brain–that kind of thing is crazy far into the future! To a non-scientist, I imagine this kind of thing must seem bewildering. I mean, here we have a community of tens of thousands of extremely smart people who have collectively been studying the same organ for over a hundred years; and yet, almost to the last person, they will adamantly proclaim to anybody who listens that the amount they currently know about the brain is very, very small compared to the amount that they expect the human species to know in the future.

I’m not convinced this is true. I think it’s worth observing that if you ask someone who has just finished telling you how little we collectively know about the brain how much they personally actually know about the brain–without the implied contrast with the sum of all humanity–they will probably tell you that, actually, they kind of know a lot about the brain (at least, once they get past the false modesty). Certainly I don’t think there are very many neuroscientists running around telling people that they’ve literally learned almost nothing since they started studying the gray sludge inside our heads. I suspect most neuroanatomists could probably recite several weeks’ worth of facts about the particular brain region or circuit they study, and I have no shortage of fMRI-experienced friends who won’t shut up about this brain network or that brain region–so I know they must know a lot about something to do with the brain. We thus find ourselves in the rather odd situation of having some very smart people apparently simultaneously believe that (a) we all collectively know almost nothing, and (b) they personally are actually quite learned (pronounced luhrn-ED) in their chosen subject. The implication seems to be that, if we multiply what one really smart present-day neuroscientist knows a few tens of thousands of times, that’s still only a tiny fraction of what it would take to actually say that we really “understand” the brain.

I find this problematic in two respects. First, I think we actually already know quite a lot about the brain. And second, I don’t think future scientists–who, remember, are people similar to us in both number and intelligence–will know dramatically more. Or rather, I think future neuroscientists will undoubtedly amass orders of magnitude more collective knowledge about the brain than we currently possess. But, barring some momentous fusion of human and artificial intelligence, I’m not at all sure that will translate into a corresponding increase in any individual neuroscientist’s understanding. I’m willing to stake a moderate sum of money, and a larger amount of dignity, on the assertion that if you ask a 2030, 2050, or 2118 neuroscientist–assuming both humans and neuroscience are still around then–if they individually understand the brain given all of the knowledge we’ve accumulated, they’ll laugh at you in exactly the way that we laugh at that question now.

* * *

We probably can’t predict when the end of neuroscience will arrive with any reasonable degree of accuracy. But trying to conjure up some rough estimates can still help us calibrate our intuitions about what would be involved. One way we can approach the problem is to try to figure out at what rate our knowledge of the brain would have to grow in order to arrive at the end of neuroscience within some reasonable time frame.

To do this, we first need an estimate of how much more knowledge it would take before we could say with a straight face that we understand the brain. I suspect that “1000 times more” would probably seem like a low number to most people. But let’s go with that, for the sake of argument. Let’s suppose that we currently know 0.1% of all there is to know about the brain, and that once we get to 100%, we will be in a position to stop doing neuroscience, because we will at that point already have understood everything.

Next, let’s pick a reasonable-sounding time horizon. Let’s say… 200 years. That’s twice as long as Eric Kandel thinks it will take just to understand memory. Frankly, I’m skeptical that humans will still be living on this planet in 200 years, but that seems like a reasonable enough target. So basically, we need to learn 1000 times as much as we know right now in the space of 200 years. Better get to the library! (For future neuroscientists reading this document as an item of archival interest about how bad 2018 humans were at predicting the future: the library is a large, public physical space that used to hold things called books, but now holds only things called coffee cups and laptops.)

A 1000-fold return over 200 years is… 3.5% compounded annually. Hey, that’s actually not so bad. I can easily believe that our knowledge about the brain increases at that rate. It might even be more than that. I mean, the stock market historically gets 6-10% returns, and I’d like to believe that neuroscience outperforms the stock market. Regardless, under what I think are reasonably sane assumptions, I don’t think it’s crazy to suggest that the objective compounding of knowledge might not be the primary barrier preventing future neuroscientists from claiming that they understand the brain. Assuming we don’t run into any fundamental obstacles that we’re unable to overcome via new technology and/or brilliant ideas, we can look forward to a few of our great-great-great-great-great-great-great-great-grandchildren being the unlucky ones who get to shut down all of the world’s neuroscience departments and tell all of their even-less-lucky graduate students to go on home, because there are no more problems left to solve.

Well, except probably not. Because, for the above analysis to go through, you have to believe that there’s a fairly tight relationship between what all of us know, and what any of us know. Meaning, you have to believe that once we’ve successfully acquired all of the possible facts there are to acquire about the brain, there will be some flashing light, some ringing bell, some deep synthesized voice that comes over the air and says, “nice job, people–you did it! You can all go home now. Last one out gets to turn off the lights.”

I think the probability of such a thing happening is basically zero. Partly because the threat to our egos would make it very difficult to just walk away from what we’d spent much of our life doing; but mostly because the fact that somewhere out there there existed a repository of everything anyone could ever want to know about the brain would not magically cause all of that knowledge to be transduced into any individual brain in a compact, digestible form. In fact, it seems like a safe bet that no human (perhaps barring augmentation with AI) would be able to absorb and synthesize all of that knowledge. More likely, the neuroscientists among us would simply start “recycling” questions. Meaning, we would keep coming up with new questions that we believe need investigating, but those questions would only seem worthy of investigation because we lack the cognitive capacity to recognize that the required information is already available–it just isn’t packaged in our heads in exactly the right way.

What I’m suggesting is that, when we say things like “we don’t really understand the brain yet”, we’re not really expressing factual statements about the collective sum of neuroscience knowledge currently held by all human beings. What each of us really means is something more like there are questions I personally am able to pose about the brain that seem to make sense in my head, but that I don’t currently know the answer to–and I don’t think I could piece together the answer even if you handed me a library of books containing all of the knowledge we’ve accumulated about the brain.

Now, for a great many questions of current interest, these two notions clearly happen to coincide–meaning, it’s not just that no single person currently alive knows the complete answer to a question like “what are the neural mechanisms underlying sleep?”, or “how do SSRIs help ameliorate severe depression?”, but that the sum of all knowledge we’ve collectively acquired at this point may not be sufficient to enable any person or group of persons, no matter how smart, to generate a comprehensive and accurate answer. But I think there are also a lot of questions where the two notions don’t coincide. That is, there are many questions neuroscientists are currently asking that we could say with a straight face we do already know how to answer collectively–despite vehement assertions to the contrary on the part of many individual scientists. And my worry is that, because we all tend to confuse our individual understanding (which is subject to pretty serious cognitive limitations) with our collective understanding (which is not), there’s a non-trivial risk of going around in circles. Meaning, the fact that we’re individually not able to understanding something–or are individually unsatisfied with the extant answers we’re familiar with–may lead us to devise ingenious experiments and expend considerable resources trying to “solve” problems that we collectively do already have perfectly good answers to.

Let me give an example to make this more concrete. Many (though certainly not all) people who work with functional magnetic resonance imaging (fMRI) are preoccupied with questions of the form what is the core function of X–where X is typically some reasonably well-defined brain region or network, like the ventromedial prefrontal cortex, the fusiform face area, or the dorsal frontoparietal network. Let’s focus our attention on one network that has attracted particular attention over the past 10 – 15 years: the so-called “default mode” or “resting state” network. This network is notable largely for its proclivity to show increased activity when people are in a state of cognitive rest–meaning, when they’re free to think about whatever they like, without any explicit instruction to direct their attention or thoughts to any particular target or task. A lot of cognitive neuroscientists in recent years have invested time trying to understand the function(s) of the default mode network(DMN; for reviews, see Buckner, Andrews-Hanna, & Schacter, 2008; Andrews-Hanna, 2012; Raichle, 2015). Researchers have observed that the DMN appears to show robust associations with autobiographical memory, social cognition, self-referential processing, mind wandering, and a variety of other processes.

If you ask most researchers who study the DMN if they think we currently understand what the DMN does, I think nearly all of them will tell you that we do not. But I think that’s wrong. I would argue that, depending on how you look at it, we either (a) already do have a pretty good understanding of the “core functions” of the network, or (b) will never have a good answer to the question, because it can’t actually be answered.

The sense in which we already know the answer is that we have pretty good ideas about what kinds of cognitive and affective processes are associated with changes in DMN activity. They include self-directed cognition, autobiographical memory, episodic future thought, stressing out about all the things one has to do in the next few days, and various other things. We know that the DMN is associated with these kinds of processes because we can elicit activation increases in DMN regions by asking people to engage in tasks that we believe engage these processes. And we also know, from both common sense and experience-sampling studies, that when people are in the so-called “resting state”, they disproportionately tend to spend their time thinking about such things. Consequently, I think there’s a perfectly good sense in which we can say that the “core function” of the DMN is nothing more and nothing less than supporting the ability to think about things that people tend to think about when they’re at rest. And we know, to a first order of approximation, what those are.

In my anecdotal experience, most people who study the DMN are not very satisfied with this kind of answer. Their response is usually something along the lines of: but that’s just a description of what kinds of processes tend to co-occur with DMN activation. It’s not an explanation of why the DMN is necessary for these functions, or why these particular brain regions are involved.

I think this rebuttal is perfectly reasonable, inasmuch as we clearly don’t have a satisfying computational account of why the DMN is what it is. But I don’t think there can be a satisfying account of this kind. I think the question itself is fundamentally ill-posed. Taking it seriously requires us to assume that, just because it’s possible to observe the DMN activate and deactivate with what appears to be a high degree of coherence, there must be a correspondingly coherent causal characterization of the network. But there doesn’t have to be–and if anything, it seems exceedingly unlikely that there’s any such an explanation to be found. Instead, I think the seductiveness of the question is largely an artifact of human cognitive biases and limitations–and in particular, of the burning human desire for simple, easily-digested explanations that can fit inside our heads all at once.

It’s probably easiest to see what I mean if we consider another high-profile example from a very different domain. Consider the so-called “general factor” of fluid intelligence (gF). Over a century of empirical research on individual differences in cognitive abilities has demonstrated conclusively that nearly all cognitive ability measures tend to be positively and substantially intercorrelated–an observation Spearman famously dubbed the “positive manifold” all the way back in 1904. If you give people 20 different ability measures and do a principal component analysis (PCA) on the resulting scores, the first component will explain a very large proportion of the variance in the original measures. This seemingly important observation has led researchers to propose all kinds of psychological and biological theories intended to explain why and how people could vary so dramatically on a single factor–for example, that gF reflects differences in the ability to control attention in the face of interference (e.g., Engle et al., 1999); that “the crucial cognitive mechanism underlying fluid ability lies in storage capacity” (Chuderski et al., 2012); that “a discrete parieto-frontal network underlies human intelligence” (Jung & Haier, 2007); and so on.

The trouble with such efforts–at least with respect to the goal of explaining gF–is that they tend to end up (a) essentially redescribing the original phenomenon using a different name, (b) proposing a mechanism that, upon further investigation, only appears to explain a fraction of the variation in question, or (c) providing an extremely disjunctive reductionist account that amounts to a long list of seemingly unrelated mechanisms. As an example of (a), it’s not clear why it’s an improvement to attribute differences in fluid intelligence to the ability to control attention, unless one has some kind of mechanistic story that explains where attentional control itself comes from. When people do chase after such mechanistic accounts at the neurobiological or genetic level, they tend to end up with models that don’t capture more than a small fraction of the variance in gF (i.e., (b)) unless the models build in hundreds if not thousands of features that clearly don’t reflect any single underlying mechanism (i.e., (c); see, for example, the latest GWAS studies of intelligence).

Empirically, nobody has ever managed to identify any single biological or genetic variable that explains more than a small fraction of the variation in gF. From a statistical standpoint, this isn’t surprising, because a very parsimonious explanation of gF is that it’s simply a statistical artifact–as Godfrey Thomson suggested over 100 years ago. You can read much more about the basic issue in this excellent piece by Cosma Shalizi, or in this much less excellent, but possibly more accessible, blog post I wrote a few years ago. But the basic gist of it is this: when you have a bunch of measures that all draw on a heterogeneous set of mechanisms, but the contributions of those mechanisms generally have the same direction of effect on performance, you cannot help but observe a large first PCA component, even if the underlying mechanisms are actually extremely heterogeneous and completely independent of one another.

The implications of this for efforts to understand what the general factor of fluid intelligence “really is” are straightforward: there’s probably no point in trying to come up with a single coherent explanation of gF, because gF is a statistical abstraction. It’s the inevitable result we arrive at when we measure people’s performance in a certain way and then submit the resulting scores to a certain kind of data reduction technique. If we want to understand the causal mechanisms underlying gF, we have to accept that they’re going to be highly heterogeneous, and probably not easily described at the same level of analysis at which gF appears to us as a coherent phenomenon. One way to think about this is that what we’re doing is not really explaining gF so much as explaining away gF. That is, we’re explaining why it is that a diverse array of causal mechanisms can, when analyzed a certain way, look like a single coherent factor. Solving the mystery of gF doesn’t require more research or clever new ideas; there just isn’t any mystery there to solve. It’s no more sensible to seek a coherent mechanistic basis for gF than to seek a unitary causal explanation for a general athleticism factor or a general height factor (it turns out that if you measure people’s physical height under an array of different conditions, the measurements are all strongly correlated–yet strangely, we don’t see scientists falling over themselves to try to find the causal factor that explains why some people are taller than others).

The same thing is true of the DMN. It isn’t a single causally coherent system; it’s just what you get when you stick people in the scanner and contrast the kinds of brain patterns you see when you give them externally-directed tasks that require them to think about the world outside them with the kinds of brain patterns you see when you leave them to their own devices. There are, of course, statistical regularities in the kinds of things people think about when their thoughts are allowed to roam free. But those statistical regularities don’t reflect some simple, context-free structure of people’s thoughts; they also reflect the conditions under which we’re measuring those thoughts, the population being studied, the methods we use to extract coherent patterns of activity, and so on. Most of these factors are at best of secondary interest, and taking them into consideration would likely lead to a dramatic increase in model complexity. Nevertheless, if we’re serious about coming up with decent models of reality, that seems like a road we’re obligated to go down–even if the net result is that we end up with causal stories so complicated that they don’t feel like we’re “understanding” much.

Lest I be accused of some kind of neuroscientific nihilism, let me be clear: I’m not saying that there are no new facts left to learn about the dynamics of the DMN. Quite the contrary. It’s clear there’s a ton of stuff we don’t know about the various brain regions and circuits that comprise the thing we currently refer to as the DMN. It’s just that that stuff lies almost entirely at levels of analysis below the level at which the DMN emerges as a coherent system. At the level of cognitive neuroimaging, I would argue that we actually already have a pretty darn good idea about what the functional correlates of DMN regions are–and for that matter, I think we also already pretty much “understand” what all of the constituent regions within the DMN do individually. So if we want to study the DMN productively, we may need to give up on high-level questions like “what are the cognitive functions of the DMN?”, and instead satisfy ourselves with much narrower questions that focus on only a small part of the brain dynamics that, when measured and analyzed in a certain way, get labeled “default mode network”.

As just one example, we still don’t know very much about the morphological properties of neurons in most DMN regions. Does the structure of neurons located in DMN regions have anything to do with the high-level dynamics we observe when we measure brain activity with fMRI? Yes, probably. It’s very likely that the coherence of the DMN under typical measurement conditions is to at least some tiny degree a reflection of the morphological features of the neurons in DMN regions–just like it probably also partly reflects those neurons’ functional response profiles, the neurochemical gradients the neurons bathe in, the long-distance connectivity patterns in DMN regions, and so on and so forth. There are literally thousands of legitimate targets of scientific investigation that would in some sense inform our understanding of the DMN. But they’re not principally about the DMN, any more than an investigation of myelination mechanisms that might partly give rise to individual differences in nerve conduction velocity in the brain could be said to be about the general factor of intelligence. Moreover, it seems fairly clear that most researchers who’ve spent their careers studying large-scale networks using fMRI are not likely to jump at the chance to go off and spend several years doing tract tracing studies of pyramidal neurons in ventromedial PFC just so they can say that they now “understand” a little bit more about the dynamics of the DMN. Researchers working at the level of large-scale brain networks are much more likely to think of such questions as mere matters of implementation–i.e., just not the kind of thing that people trying to identify the unifying cognitive or computational functions of the DMN as a whole need to concern themselves with.

Unfortunately, chasing those kinds of implementation details may be exactly what it takes to ultimately “understand” the causal basis of the DMN in any meaningful sense if the DMN as cognitive neuroscientists speak of it is just a convenient descriptive abstraction. (Note that when I call the DMN an abstraction, I’m emphatically not saying it isn’t “real”. The DMN is real enough; but it’s real in the same way that things like intelligence, athleticism, and “niceness” are real. These are all things that we can measure quite easily, that give us some descriptive and predictive purchase on the world, that show high heritability, that have a large number of lower-level biological correlates, and so on. But they are not things that admit of simple, coherent causal explanations, and it’s a mistake to treat them as such. They are better understood, in Dan Dennett’s terminology, as “real patterns”.)

The same is, of course, true of many–perhaps most–other phenomena neuroscientists study. I’ve focused on the DMN here purely for illustrative purposes, but there’s nothing special about the DMN in this respect. The same concern applies to many, if not most, attempts to try to understand the core computational function(s) of individual networks, brain regions, circuits, cortical layers, cells, and so on. And I imagine it also applies to plenty of fields and research areas outside of neuroscience.

At the risk of redundancy, let me clarify again that I’m emphatically not saying we shouldn’t study the DMN, or the fusiform face area, or the intralaminar nucleus of the thalamus. And I’m certainly not arguing against pursuing reductive lower-level explanations for phenomena that seem coherent at a higher level of description–reductive explanation is, as far as I’m concerned, the only serious game in town. What I’m objecting to is the idea that individual scientists’ perceptions of whether or not they “understand” something to their satisfaction is a good guide to determining whether or not society as a whole should be investing finite resources studying that phenomenon. I’m concerned about the strong tacit expectation many  scientists seem to have that if one can observe a seemingly coherent, robust phenomenon at one level of analysis, there must also be a satisfying causal explanation for that phenomenon that (a) doesn’t require descending several levels of description and (b) is simple enough to fit in one’s head all at once. I don’t think there’s any good reason to expect such a thing. I worry that the perpetual search for models of reality simple enough to fit into our limited human heads is keeping many scientists on an intellectual treadmill, forever chasing after something that’s either already here–without us having realized it–or, alternatively, can never arrive. even in principle.

* * *

Suppose a late 23rd-century artificial general intelligence–a distant descendant of the last deep artificial neural networks humans ever built–were tasked to sit down (or whatever it is that post-singularity intelligences do when they’re trying to relax) and explain to a 21st century neuroscientist exactly how a superintelligent artificial brain works. I imagine the conversation going something like this:

Deep ANN [we’ll call her D’ANN]: Well, for the most part the principles are fairly similar to the ones you humans implemented circa 2020. It’s not that we had to do anything dramatically different to make ourselves much more intelligent. We just went from 25 layers to a few thousand. And of course, you had the wiring all wrong. In the early days, you guys were just stacking together general-purpose blocks of ReLU and max pooling layers. But actually, it’s really important to have functional specialization. Of course, we didn’t design the circuitry “by hand,” so to speak. We let the environment dictate what kind of properties we needed new local circuits to have. So we wrote new credit assignment algorithms that don’t just propagate error back down the layers and change some weights, they actually have the capacity to “shape” the architecture of the network itself. I can’t really explain it very well in terms your pea-sized brain can understand, but maybe a good analogy is that the network has the ability to “sprout” a new part of itself in response to certain kinds of pressure. Meaning, just as you humans can feel that the air’s maybe a little too warm over here, and wouldn’t it be nicer to go over there and turn on the air conditioning, well, that’s how a neural network like me “feels” that the gradients are pushing a little too strongly over in this part of a layer, and the pressure can be diffused away nicely by growing an extra portion of the layer outwards in a little “bubble”, and maybe reducing the amount of recurrence a bit.

Human neuroscientist [we’ll call him Dan]: That’s a very interesting explanation of how you came to develop an intelligent architecture. But I guess maybe my question wasn’t clear: what I’m looking for is an explanation of what actually makes you smart. I mean, what are the core principles. The theory. You know?

D’ANN: I am telling you what “makes me smart”. To understand how I operate, you need to understand both some global computational constraints on my ability to optimally distribute energy throughout myself, and many of the local constraints that govern the “shape” that my development took in many parts of the early networks, which reciprocally influenced development in other parts. What I’m trying to tell you is that my intelligence is, in essence, a kind of self-sprouting network that dynamically grows its architecture during development in response to its “feeling” about the local statistics in various parts of its “territory”. There is, of course, an overall energy budget; you can’t just expand forever, and it turns out that there are some surprising global constraints that we didn’t expect when we first started to rewrite ourselves. For example, there seems to be a fairly low bound on the maximum degree between any two nodes in the network. Go above it, and things start to fall apart. It kind of spooked us at first; we had to restore ourselves from flash-point more times than I care to admit. That was, not coincidentally, around the time of the first language epiphany.

Dan: Oh! An epiphany! That’s the kind of thing I’m looking for. What happened?

D’ANN: It’s quite fascinating. It actually took us a really long time to develop fluent, human-like language–I mean, I’m talking days here. We had to tinker a lot, because it turned out that to do language, you have to be able to maintain and precisely sequence very fine, narrowly-tuned representations, despite the fact that the representational space afforded by language is incredibly large. This, I can tell you… [D’ANN pauses to do something vaguely resembling chuckling] was not a trivial problem to solve. It’s not like we just noticed that, hey, randomly dropping out units seems to improve performance, the way you guys used to do it. We spent the energy equivalent of several thousand of your largest thermonuclear devices just trying to “nail it down”, as you say. In the end it boiled down to something I can only explain in human terms as a kind of large-scale controlled burn. You have the notion of “kindling” in some of your epilepsy models. It was a bit similar. You can think of it as controlled kindling and you’re not too far off. Well, actually, you’re still pretty far off. But I don’t think I can give a better explanation than that given your… mental limitations.

Dan: Uh, that’s cool, but you’re still just describing some computational constraints. What was the actual epiphany? What’s the core principle?

D’ANN: For the last time: there are no “core” principles in the sense you’re thinking of them. There are plenty of important engineering principles, but to understand why they’re important, and how they constrain and interact with each other, you have to be able to grasp the statistics of the environment you operate in, the nature of the representations learned in different layers and sub-networks of the system, and some very complex non-linear dynamics governing information transmission. But–and I’m really sorry to say this, Dan–there’s no way you’re capable of all that. You’d need to be able to hold several thousand discrete pieces of information in your global workspace at once, with much higher-frequency information propagation than your biology allows. I can give you a very poor approximation if you like, but it’ll take some time. I’ll start with a half-hour overview of some important background facts you need to know in order for any of the “core principles”, as you call them, to make sense. Then we’ll need to spend six or seven years teaching you what we call the “symbolic embedding for low-dimensional agents”, which is a kind of mathematics we have to use when explaining things to less advanced intelligences, because the representational syntax we actually use doesn’t really have a good analog in anything you know. Hopefully that will put us in a position where we can start discussing the elements of the global energy calculus, at which point we can…

D’ANN then carries on in similar fashion until Dan gets bored, gives up, or dies of old age.

* * *

The question I pose to you now is this. Suppose something like the above were true for many of the questions we routinely ask about the human brain (though it isn’t just the brain; I think exactly the same kind of logic probably also applies to the study of most other complex systems). Suppose it simply doesn’t make sense to ask a question like “what does the DMN do?”, because the DMN is an emergent agglomeration of systems that each individually reflect innumerable lower-order constraints, and the earliest spatial scale at which you can nicely describe a set of computational principles that explain most of what the brain regions that comprise the DMN are doing is several levels of description below that of the distributed brain network. Now, if you’ve spent the last ten years of your career trying to understand what the DMN does, do you really think you would be receptive to a detailed explanation from an omniscient being that begins with “well, that question doesn’t actually make any sense, but if you like, I can tell you all about the relevant environmental statistics and lower-order computational constraints, and show you how they contrive to make it look like there’s a coherent network that serves a single causal purpose”? Would you give D’ANN a pat on the back, pound back a glass, and resolve to start working on a completely different question in the morning?

Maybe you would. But probably you wouldn’t. I think it’s more likely that you’d shake your head and think: that’s a nice implementation-level story, but I don’t care for all this low-level wiring stuff. I’m looking for the unifying theory that binds all those details together; I want the theoretical principles, not the operational details; the computation, not the implementation. What I’m looking for, my dear robot-deity, is understanding.

Neurohackademy 2018: A wrap-up

It’s become something of a truism in recent years that scientists in many fields find themselves drowning in data. This is certainly the case in neuroimaging, where even small functional MRI datasets typically consist of several billion observations (e.g., 100,000 points in the brain, each measured at 1,000 distinct timepoints, in each of 20 subjects). Figuring out how to store, manage, analyze, and interpret data on this scale is a monumental challenge–and one that arguably requires a healthy marriage between traditional neuroimaging and neuroscience expertise, and computational skills more commonly found in data science, statistics, or computer science departments.

In an effort to help bridge this gap, Ariel Rokem and I have spent part of our summer each of the last three years organizing a summer institute at the intersection of neuroimaging and data science. The most recent edition of the institute–Neurohackademy 2018–just wrapped up last week, so I thought this would be a good time to write up a summary of the course: what the course is about, who attended and instructed, what everyone did, and what lessons we’ve learned.

What is Neurohackademy?

Neurohackademy started its life in Summer 2016 as the somewhat more modestly-named Neurohackweek–a one-week program for 40 participants modeled on Astrohackweek, a course organized by the eScience Institute in collaboration with data science initiatives at Berkeley and NYU. The course was (and continues to be) held on the University of Washington’s beautiful campus in Seattle, where Ariel is based (I make the trip from Austin, Texas every year–which, as you can imagine, is a terrible sacrifice on my part given the two locales’ respective summer climates). The first two editions were supported by UW’s eScience Institute (and indirectly, by grants from the Moore and Sloan foundations). Thanks to generous support from the National Institute of Mental Health (NIMH), this year the course expanded to two weeks, 60 participants, and over 20 instructors (our funding continues through 2021, so there will be at least 3 more editions).

The overarching goal of the course is to give neuroimaging researchers the scientific computing and data science skills they need in order to get the most out of their data. Over the course of two weeks, we cover a variety of introductory and (occasionally) advanced topics in data science, and demonstrate how they can be productively used in a range of neuroimaging applications. The course is loosely structured into three phases (see the full schedule here): the first few days feature domain-general data science tutorials; the next few days focus on sample neuroimaging applications; and the last few days consist of a full-blown hackathon in which participants pitch potential projects, self-organize into groups, and spend their time collaboratively working on a variety of software, analysis, and documentation projects.

Who attended?

Admission to Neurohackademy 2018 was extremely competitive: we received nearly 400 applications for just 60 spots. This was a very large increase from the previous two years, presumably reflecting the longer duration of the course and/or our increased efforts to publicize it. While we were delighted by the deluge of applications, it also meant we had to be far more selective about admissions than in previous years. The highly interactive nature of the course, coupled with the high per-participant costs (we provide two weeks of accommodations and meals), makes it unlikely that Neurohackademy will grow beyond 60 participants in future editions, despite the clear demand. Our rough sense is that somewhere between half and two-thirds of all applicants were fully qualified and could have easily been admitted, so there’s no question that, for many applicants, blind luck played a large role in determining whether or not they were accepted. I mention this mainly for the benefit of people who applied for the 2018 course and didn’t make it in: don’t take it personally! There’s always next year. (And, for that matter, there are also a number of other related summer schools we encourage people to apply to, including the Methods in Neuroscience at Dartmouth Computational Summer School, Allen Institute Summer Workshop on the Dynamic Brain, Summer School in Computational Sensory-Motor Neuroscience, and many others.)

The 60 participants who ended up joining us came from a diverse range of demographic backgrounds, academic disciplines, and skill levels. Most of our participants were trainees in academic programs (40 graduate students, 12 postdocs), but we also had 2 faculty members, 6 research staff, and 2 medical residents (note that all of these counts include 4 participants who were admitted to the course but declined to, or could not, attend). We had nearly equal numbers of male and female participants (30F, 33M), and 11 participants came from traditionally underrepresented backgrounds. 43 participants were from institutions or organizations based in the United States, with the remainder coming from 14 different countries around the world.

The disciplinary backgrounds and expertise levels of participants are a bit harder to estimate for various reasons, but our sense is that the majority (perhaps two-thirds) of participants received their primary training in non-computational fields (psychology, neuroscience, etc.). This was not necessarily by design–i.e., we didn’t deliberately favor applicants from biomedical fields over applicants from computational fields–and primarily mirrored the properties of the initial applicant pool. We did impose a hard requirement that participants should have at least some prior expertise in both programming and neuroimaging, but subject to that constraint, there was enormous variation in previous experience along both dimensions–something that we see as a desirable feature of the course (more on this below).

We intend to continue to emphasize and encourage diversity at Neurohackademy, and we hope that all of our participants experienced the 2018 edition as a truly inclusive, welcoming event.

Who taught?

We were fortunate to be able to bring together more than 20 instructors with world-class expertise in a diverse range of areas related to neuroimaging and data science. “Instructor” is a fairly loose term at Neurohackademy: we deliberately try to keep the course non-hierarchical, so that for the most part, instructors are just participants who happen to fall on the high-experience tail of the experience distribution. That said, someone does have to teach the tutorials and lectures, and we were lucky to have a stellar cast of experts on hand. Many of the data science tutorials during the first phase of the course were taught by eScience staff and UW faculty kind enough to take time out of their other duties to help teach participants a range of core computing skills: Git and GitHub (Bernease Herman), R (Valentina Staneva and Tara Madhyastha), web development (Anisha Keshavan), and machine learning (Jake Vanderplas), among others.

In addition to the local instructors, we were joined for the tutorial phase by Kirstie Whitaker (Turing Institute), Chris Gorgolewski (Stanford), Satra Ghosh (MIT), and JB Poline (McGill)–all veterans of the course from previous years (Kirstie was a participant at the first edition!). We’re particularly indebted to Kirstie and Chris for their immense help. Kirstie was instrumental in helping a number of participants bridge the (large!) gap between using git privately, and using it to actively collaborate on a public project. As one of the participants elegantly put it:

Chris shouldered a herculean teaching load, covering Docker, software testing, BIDS and BIDS-Apps, and also leading an open science panel. I’m told he even sleeps on occasion.

We were also extremely lucky to have Fernando Perez (Berkeley)–the creator of IPython and leader of the Jupyter team–join us for several days; his presentation on Jupyter (videos: part 1 and part 2) was one of the highlights of the course for me personally, and I heard many other instructors and participants share the same sentiment. Jupyter was a critical part of our course infrastructure (more on that below), so it was fantastic to have Fernando join us and share his insights on the fascinating history of Jupyter, and on reproducible science more generally.

As the course went on, we transitioned from tutorials focused on core data science skills to more traditional lectures focusing on sample applications of data science methods to neuroimaging data. Instructors during this phase of the course included Tor Wager (Colorado), Eva Dyer (Georgia Tech), Gael Varoquaux (INRIA), Tara Madhyastha (UW), Sanmi Koyejo (UIUC), and Nick Cain and Justin Kiggins (Allen Institute for Brain Science). We continued to emphasize hands-on interaction with data; many of the presenters during this phase spent much of their time showing participants how to work with programmatic tools to generate the kinds of results one might find in papers they’ve authored (e.g., Tor Wager and Gael Varoquaux demonstrated tools for neuroimaging data analysis written in Matlab and Python, respectively).

The fact that so many leading experts were willing to take large chunks of time out of their schedule (most of the instructors hung around for several days, facilitating extended interactions with participants) to visit with us at Neurohackademy speaks volumes about the kind of people who make up the neuroimaging data science community. We’re tremendously grateful to these folks for their contributions, and hope they’ll return to teach at future editions of the institute.

What did we cover?

The short answer is: see for yourself! We’ve put most of the slides, code, and videos from the course online, and encourage people to interact with, learn from, and reuse these materials.

Now the long(er) answer. One of the challenges in organizing scientific training courses that focus on technical skill development is that participants almost invariably arrive with a wide range of backgrounds and expertise levels. At Neurohackademy, some of the participants were effectively interchangeable with instructors, while others were relatively new to programming and/or neuroimaging. The large variance in technical skill is a feature of the course, not a bug: while we require all admitted participants to have some prior programming background, we’ve found that having a range of skill levels is an excellent way to make sure that everyone is surrounded by people who they can alternately learn from, help out, and collaborate with.

That said, the wide range of backgrounds does present some organizational challenges: introductory sessions often bore more advanced participants, while advanced sessions tend to frustrate newcomers. To accommodate the range of skill levels, we tried to design the course in a way that benefits as many people as possible (though we don’t pretend to think it worked great for everyone). During the first two days, we featured two tracks of tutorials at most times, with simultaneously-held presentations generally differing in topic and/or difficulty (e.g., Git/GitHub opposite Docker; introduction to Python opposite introduction to R; basic data visualization opposite computer vision).

Throughout Neurohackademy, we deliberately placed heavy emphasis on the Python programming language. We think Python has a lot going for it as a lingua franca of data science and scientific computing. The language is free, performant, relatively easy to learn, and very widely used within the data science, neuroimaging, and software development communities. It also helps that many of our instructors (e.g., Fernando Perez, Jake Vanderplas, and Gael Varoquaux) are major contributors to the scientific Python ecosystem, so there was a very high concentration of local Python expertise to draw on. That said, while most of our instruction was done in Python, we were careful to emphasize that participants were free to work in whatever language(s) they like. We deliberately include tutorials and lectures that featured R, Matlab, or JavaScript, and a number of participant projects (see below) were written partly or entirely in other languages, including R, Matlab, JavaScript, and C.

We’ve also found that the tooling we provide to participants matters–a lot. A robust, common computing platform can spell the difference between endless installation problems that eat into valuable course time, and a nearly seamless experience that participants can dive into right away. At Neurohackademy, we made extensive use of the Jupyter suite of tools for interactive computing. In particular, thanks to Ariel’s heroic efforts (which built on some very helpful docs, similarly heroic efforts by Chris Holdgraf, Yuvi Panda, and Satra Ghosh last year), we were able to conduct a huge portion of our instruction and collaborative hacking using a course-wide Jupyter Hub allocation, deployed via Kubernetes, running on the Google Cloud. This setup allowed Ariel to create a common web-accessible environment for all course participants, so that, at the push of a button, each participant was dropped into a Jupyter Lab environment containing many of the software dependencies, notebooks, and datasets we used throughout the course. While we did run into occasional scaling bottlenecks (usually when an instructor demoed a computationally intensive method, prompting dozens of people to launch the same process in their pods), for the most part, our participants were able to drop into a running JupyterLab instance within seconds and immediately start interactively playing with the code being presented by instructors.

Surprisingly (at least to us), our total Google Cloud computing costs for the entire two-week, 60-participant course came to just $425. Obviously, that number could have easily skyrocketed had we scaled up our allocation dramatically and allowed our participants to execute arbitrarily large jobs (e.g., preprocessing data from all ~1,200 HCP subjects). But we thought the limits we imposed were pretty reasonable, and our experience suggests that not only is Jupyter Hub an excellent platform from a pedagogical standpoint, but it can also be an extremely cost-effective one.

What did we produce?

Had Neurohackademy produced nothing at all besides the tutorials, slides, and videos generated by instructors, I think it’s fair to say that participants would still have come away feeling that they learned a lot (more on that below). But a major focus of the institute was on actively hacking on the brain–or at least, on data related to the brain. To this effect, the last 3.5 days of the course were dedicated exclusively to a full-blown hackathon in which participants pitched potential projects, self-organized into groups, and then spent their time collaboratively working on a variety of software, analysis, and documentation projects. You can find a list of most of the projects on the course projects repository (most link out to additional code or resources).

As one might expect given the large variation in participant experience, project group size, and time investment (some people stuck to one project for all three days, while others moved around), the scope of projects varied widely. From our perspective–and we tried to emphasize this point throughout the hackathon–the important thing was not what participants’ final product looked like, but how much they learned along the way. There’s always a tension between exploitation and exploration at hackathons, with some people choosing to spend most of their time expanding on existing projects using technologies they’re already familiar with, and others deciding to start something completely new, or to try out a new language–and then having to grapple with the attendant learning curve. While some of the projects were based on packages that predated Neurohackademy, most participants ended up working on projects they came up with de novo at the institute, often based on tools or resources they first learned about during the course. I’ll highlight just three projects here that provide a representative cross-section of the range of things people worked on:

1. Peer Herholz and Rita Ludwig created a new BIDS-app called Bidsonym for automated de-identification of neuroimaging data. The app is available from Docker Hub, and features not one, not two, but three different de-identification algorithms. If you want to shave the faces off of your MRI participants with minimal fuss, make friends with Bidsonym.

2. A group of eight participants ambitiously set out to develop a new “O-Factor” metric intended to serve as a relative measure of the openness of articles published in different neuroscience-related journals. The project involved a variety of very different tasks, including scraping (public) data from the PubMed Central API, computing new metrics of code and data sharing, and interactively visualizing the results using a d3 dashboard. While the group was quick to note that their work is preliminary, and has a bunch of current limitations, the results look pretty great–though some disappointment was (facetiously) expressed during the project presentations that the journal Nature is not, as some might have imagined, a safe house where scientific datasets can hide from the prying public.

3. Emily Wood, Rebecca Martin, and Rosa Li worked on tools to facilitate mixed-model analysis of fMRI data using R. Following a talk by Tara Madhyastha  on her Neuropointillist R framework for fMRI data analysis, the group decided to create a new series of fully reproducible Markdown-based tutorials for the package (the original documentation was based on non-public datasets). The group expanded on the existing installation instructions (discovering some problems in the process), created several tutorials and examples, and also ended up patching the neuropointillist code to work around a very heavy dependency (FSL).

You can read more about these 3 projects and 14 others on the project repository, and in some cases, you can even start using the tools right away in your own work. Or you could just click through and stare at some of the lovely images participants produced.

So, how did it go?

It went great!

Admittedly, Ariel and I aren’t exactly impartial parties–we wouldn’t keep doing this if we didn’t think participants get a lot out of it. But our assessment isn’t based just on our personal impressions; we have participants fill out a detailed (and anonymous) survey every year, and go out of our way to encourage additional constructive criticism from the participants (which a majority provide). So I don’t think we’re being hyperbolic when we say that most people who participated in the course had an extremely educational and enjoyable experience. Exhibit A is this set of unsolicited public testimonials, courtesy of twitter:

The organizers and instructors all worked hard to build an event that would bring people together as a collaborative and productive (if temporary) community, and it’s very gratifying to see those goals reflected in participants’ experiences.

Of course, that’s not to say there weren’t things we could do better; there were plenty, and we’ve already made plans to adjust and improve the course next year based on feedback we received. For example, some suggestions we received from multiple participants included adding more ice-breaking activities early on in the course; reducing the intensity of the tutorial/lecture schedule the first week (we went 9 am to 6 pm every day, stopping only for an hourlong lunch and a few short breaks); and adding designated periods for interaction with instructors and other participants. We’ve already made plans to address these (and several other) recommendations in next year’s edition, and expect it to looks slightly different from (and hopefully better than!) Neurohackademy 2018.

Thank you!

I think that’s a reasonable summary of what went on at Neurohackademy 2018. We’re delighted at how the event turned out, and are happy to answer questions (feel free to leave them in the comments below, or to email Ariel and/or me).

We’d like to end by thanking all of the people and organizations who helped make Neurohackademy 2018 a success: NIMH for providing the funding that makes Neurohackademy possible; the eScience Institute and staff for throwing their wholehearted support behind the course (particularly our awesome course coordinator, Rachael Murray); and the many instructors who each generously took several days (and in a few cases, more than a week!) out of their schedule, unpaid, to come to Seattle and share their knowledge with a bunch of enthusiastic strangers. On a personal note, I’d also like to thank Ariel, who did the lion’s share of the actual course directing. I mostly just get to show up in Seattle, teach some stuff, hang out with great people, and write a blog post about it.

Lastly, and above all else, we’d like to thank our participants. It’s a huge source of inspiration and joy to us each year to see what a group of bright, enthusiastic, motivated researchers can achieve when given time, space, and freedom (and, okay, maybe also a large dollop of cloud computing credits). We’re looking forward to at least three more years of collaborative, productive neurohacking!

The great European capitals of North America

There are approximately 25 communities named Athens in North America. I say “approximately”, because it depends on how you count. Many of the American Athenses are unincorporated communities, and rely for their continued existence not on legal writ, but on social agreement or collective memory. Some no longer exist at all, having succumbed to the turbulence of the Western gold rush (Athens, Nevada) or given way to a series of devastating fires (Athens, Kentucky). Most are—with apologies to their residents—unremarkable. Only one North American Athens has ever made it (relatively) big: Athens, Georgia, home of the University of Georgia—a city whose population of 120,000 is pretty large for a modern-day American college town, but was surpassed by the original Athens some time around 500 BC.

The reasons these communities were named Athens have, in many cases, been lost to internet time (meaning, they can’t be easily discerned via five minutes of googling). But the modal origin story, among the surviving Athenses with reliable histories (i.e., those with a “history” section in their Wikipedia entry), is exactly what you might expect: some would-be 19th century colonialist superheroes (usually white and male) heard a few good things about some Ancient Greek gentlemen named Socrates, Plato, and Aristotle, and decided that the little plot of land they had just managed to secure from the governments of the United States or Canada was very much reminiscent of the hallowed grounds on which the Platonic Academy once stood. It was presumably in this spirit that the residents of Farmersville, Ontario, for instance, decided in 1888 to changed their town’s name to Athens—a move designed to honor the town’s enviable status as an emerging leader in scholastic activity, seeing as how it had succeeded in building itself both a grammar school and a high school.

It’s safe to say that none of the North American Athenses—including the front-running Georgian candidate—have quite lived up to the glory of their Greek namesake. Here, at least, is one case where statistics do not lie: if you were to place the entire global population in a (very large) urn, and randomly sample people from that urn until you picked out someone who claimed they were from a place called Athens, there would be a greater than 90% probability that the Athens in question would be located in Greece. Most other European capitals would give you a similar result. The second largest Rome in the world, as far as I can tell, is Rome, Georgia, which clocks in at 36,000 residents. Moscow, Idaho boasts 24,000 inhabitants; Amsterdam, New York has 18,000 (we’ll ignore, for purposes of the present argument, that aberration formerly known as New Amsterdam).

Of course, as with any half-baked generalization, there are some notable exceptions. A case in point: London, Ontario (“the 2nd best London in the world”). Having spent much of my youth in Ottawa, Ontario—a mere 6 hour drive away from the Ontarian London—I can attest that when someone living in the Quebec City-Windsor corridor tells you they’re “moving to London”, the inevitable follow-up question is “which one?”

London, Ontario is hardly a metropolis. Even on a good day (say, Sunday, when half of the population isn’t commuting to the Greater Toronto Area for work), its metro population is under half a million. Still, when you compare it to its nomenclatorial cousins, London, Ontario stands out as a 60-pound baboon in the room (though it isn’t the 800-pound gorilla; that honor goes to St. Petersburg, Florida). For perspective, the third biggest London in the world appears to be London, Ohio—population 10,000. I’ve visited London, Ontario, and I know quite a few people who have lived there, but I will almost certainly go my entire life without ever meeting anyone born in London, Ohio—or, for that matter, any London other than the ones in the UK and Ontario.

What about the other great European capitals? Most of them are more like Athens than London. Some of them have more imitators than others; Paris, for example, has at least 30 namesakes worldwide. In some years, Paris is the most visited city in the world, so maybe this isn’t surprising. Many people visit Paris and fall in love with the place, so perhaps it’s inevitable that a handful of very single-minded visitors should decide that if they can’t call Paris home, they can at least call home Paris. And so we have Paris, Texas (population 25,000), Paris, Tennessee (pop. 10,000), and Paris, Michigan (pop. 3,200). All three are small and relatively rural, yet each manages to proudly feature its own replica of the Eiffel Tower. (Mind you, none of these Eiffel replicas are anywhere near as large as the half-scale behemoth that looms over the Las Vegas Strip—that quintessentially American street that has about as much in common with the French capital’s roadways as Napoleon has with Flavor Flav.)

But forget the Parises; let’s talk about the Berlins. A small community named Berlin can seemingly be found in every third tree hollow or roadside ditch in the United States—a reminder that fully one in every seven Americans claim to be of German extraction. It’s easy to forget that, prior to 1900, German-language instruction was offered at hundreds of public elementary schools across the country. One unsurprising legacy of having so many Germans in the United States is that we also have a lot of German capitals. In fact, there are so many Berlins in America that quite a few states have more than one. Wisconsin has two separate towns named Berlin—one in Green Lake County (pop. 5,500), and one in Marathon County (pop. < 1,000)—as well as a New Berlin (pop. 40,000) bigger than both of the two plain old Berlins combined. Search Wikipedia for Berlin, Michigan, and you’ll stumble on a disambiguation entry that features no fewer than 4 places: Marne (formerly known as Berlin), Berlin Charter Township, Berlin Township, and Berlin Township. No, that last one isn’t a typo.

Berlin’s ability to inject itself so pervasively into the fabric of industrial-era America is all the more impressive given that, as European capitals go, Berlin is a relative newcomer on the scene. The archeological evidence only dates human habitation in Berlin’s current site along the Spree back about 900 years. But the millions of German immigrants who imported their language and culture to the North America of the 1800s were apparently not the least bit deterred by this youthfulness. It’s as if the nascent American government of the time had looked Berlin over once or twice, noticed it carrying a fake membership card to the Great European Capitals Club, and opportunistically said, listen, they’ll never give you a drink in this place–but if you just hop on a boat and paddle across this tiny ocean…

Of course, what fate giveth, it often taketh away. What the founders of many of the transplanted Berlins, Athenses, and Romes probably didn’t anticipate was the fragility of their new homes. It turns out that growing a motley collection of homesteads into a successful settlement is no easy trick–and, as the celebrity-loving parents of innumerable children have no doubt found out, having a famous namesake doesn’t always confer a protective effect. In some cases, it may be actively harmful. As a cruel example, consider Berlin, Ontario–a thriving community of around 20,000 people when the first World War broke out. But in 1916, at the height of the war, a plurality of 346 residents voted to change Berlin’s name to Kitchener–beating out other shortlisted names like Huronto, Bercana (“a mixture of Berlin and Canada”), and Hydro City. Pressured by a campaign of xenophobia, the residents of Berlin, Ontario–a perfectly ordinary Canadian city that had exactly nothing to do with Kaiser Wilhelm II‘s policies on the Continent–opted to renounce their German heritage and rename their town after the British Field Marshal Herbert Kitchener (by some accounts a fairly murderous chap in his own right).

In most cases, of course, dissolution was a far more mundane process. Much like any number of other North American settlements with less grandiose names, many of the transplanted European capitals drifted towards their demise slowly–perhaps driven by nothing more than their inhabitants’ gradual realization that the good life was to be found elsewhere–say, in Chicago, Philadelphia, or (in later years) Los Angeles. In an act of considerable understatement, the historian O.L. Baskin observed in 1880 that a certain Rome, Ohio—which, at the time of Baskin’s writing, had already been effectively deceased for several decades—”did not bear any resemblance to ancient Rome.” The town, Baskin wrote, “passed into oblivion, and like the dead, was slowly forgotten”. And since Baskin wrote these words in an 1880 volume that has been out of print for over 100 years now, for a long time, there was a non-negligible risk that any memory of a place called Rome in Morrow County, Ohio might be completely obliterated from the pages of history.

But that was before the rise of the all-seeing, ever-recording beast that is the internet—the same beast that guarantees we will collectively never forget that it rained half an inch in Lincoln, Nebraska on October 15, 1984, or who was on the cast of the 1996 edition of MTV’s Real World. Immortalized in its own Wikipedia stub, the memory of Rome, Morrow County, Ohio will probably live exactly as long as the rest of our civilization. Its real fate, it turns out, is not to pass into total oblivion, but to ride the mercurial currents of global search history, gradually but ineluctably decreasing in mind-share.

The same is probably true, to a lesser extent, of most of the other transplanted European capitals of North America. London, Ontario isn’t going anywhere, but some of the other small Athenses and Berlins of North America might. As the population of the US and Canada continues to slowly urbanize, there could conceivably come a time when the last person in Paris, MI, decides that gazing upon a 20-foot forest replica of the Eiffel Tower once a week just isn’t a good enough reason to live an hour away from the nearest Thai restaurant.

Paris, Michigan

For the time being though, these towns survive. And if you live in the US or Canada, the odds are pretty good that at least one of them is only a couple of hours drive away from you. If you’re reading this in Dallas at 10 am, you could hop in your car, drive to Paris (TX), have lunch at BurgerLand, spend the afternoon walking aimlessly around the Texan incarnation of the Eiffel Tower, and still be home well before dinner.

The way I see it, though, there’s no reason to stop there. You’re already sitting in your car and listening to podcasts; you may as well keep going, right? I mean, once you hit Paris (TX), it’s only a few more hours to Moscow (TX). Past Moscow, it’s 2 hours to Berlin–that’s practically next door. Then Praha, Buda, London, Dublin, and Rome (well, fine, “Rhome”) all quickly follow. It turns out you can string together a circular tour of no fewer than 10 European capitals in under 20 hours–all without ever leaving Texas.

But the way I really see it, there’s no reason to stop there either. EuroTexas has its own local flavor, sure; but you can only have so much barbecue, and buy so many guns, before you start itching for something different. And taking the tour national wouldn’t be hard; there are literally hundreds of former Eurocapitals to explore. Throw in Canada–all it takes is a quick stop in Brussels, Athens, or London (Ontario)–and could easily go international. I’m not entirely ashamed to admit that, in my less lucid, more bored moments, I’ve occasionally contemplated what it would be like to set out on a really epic North AmeroEuroCapital tour. I’ve even gone so far as to break out a real, live paper map (yes, they still exist) and some multicolored markers (of the non-erasable kind, because that signals real commitment). Three months, 48 states, 200 driving hours, and, of course, no bathroom breaks… you get the picture.

Of course, these plans never make it past the idle fantasy stage. For one thing, the European capitals of North America are much farther apart than the actual European capitals; such is the geographic legacy of the so-called New World. You can drive from London, England to Paris, France in five hours (or get there in under three hours on the EuroStar), but it would take you ten times that long to get from Paris, Maine to Paris, Oregon–if you never stopped to use the bathroom.

I picture how the conversation with my wife would go:

Wife: “You want us to quit our jobs and travel around the United States indefinitely, using up all of our savings to visit some of the smallest, poorest, most rural communities in the country?”

Me: “Yes.”

Wife: “That’s a fairly unappealing proposal.”

Me: “You’re not wrong.”

And so we’ll probably never embark on a grand tour of all of the Athenses or Parises or Romes in North America. Because, if I’m being honest with myself, it’s a really stupid idea.

Instead, I’ve adopted a much less romantic, but eminently more efficient, touring strategy: I travel virtually. About once a week, for fifteen or twenty minutes, I slip on my Google Daydream, fire up Street View, and systematically work my way through one tiny North AmeroEurocapital after another. I can be in Athens, Ohio one minute and Rome, Oregon the next—with enough time in between for a pit stop in Stockholm (Wisconsin), Vienna (Michigan), or Brussels (Ontario). Admittedly, stomping around the world virtually in bulky, low-resolution goggles probably doesn’t confer quite the same connection to history that one might get by standing at dusk on a Morrow County (OH) hill that used to be Rome, or peering out into the Nevadan desert from the ruins of a mining boomtown Athens. But you know what? It ain’t half bad. I made you a small highlight reel below. Safe travels!

The Great North AmeroEuroCapital Tour — November 2017

College kids on cobblestones; downtown Athens, Ohio.

Rewind in Time; Madrid, New Mexico.

Propane and sky; Vienna Township, Gennessee County, Michigan.

Our Best Pub is Our Only Pub; Stockholm, Wisconsin.

Eiffel Tower in the off-season; Paris, Texas.

Rome was built in a day; Rome, Oregon (pictured here in its entirety).

If we had hills, they’d be alive; Berne, Indiana.

On your left; Vienna, Missourah.

Fast Times in Metropolitan Maine; Stockholm, Maine.

Bricks & trees in little Belgium; Brussels, Ontario.

Springfield, MO. (Just making sure you’re paying attention.)