Coyne on adaptive rumination theory (again)

A while ago I blogged about Andrews and Thomson’s *adaptive rumination hypothesis* (ARH) of depression, which holds that depression is an evolutionary adaption designed to help us solve difficult problems. I linked to two critiques of ARH by Jerry Coyne, who is clearly no fan of ARH. Coyne’s now taken his argument to the [pages of Psychiatric Times|], where he tears ARH to shreds for a third time. The main thrust of Coyne’s argument is that Andrews and Thomson employ a colloquial definition of adaptation (i.e., something that’s useful) rather than the more appropriate evolution definition:
Andrews and Thomson consider depression an “adaptation“ because it supposedly helps the sufferer solve problems. But an evolutionary adaptation is more than something that is merely useful. Biologists consider a trait adaptive only if that behavior, and the genes producing it, enhance an individual’s fitness—the average lifetime output of offspring. It is this genetic advantage, and the evolutionary changes in behavior it promotes, that is the essence of adaptation by natural selection. To demonstrate that depression is an evolved adaptation, then, we must show that it enhances reproduction.
Andrews and Thomson don’t do this, or even try. And if they did try, they probably wouldn’t succeed, for everything we know about depression suggests that rather than enhancing fitness, it reduces it. The most obvious issue is suicide, a word that, curiously, does not appear in Andrews and Thomson’s text. Statistics show that those with major depression are 20 times more likely to kill themselves than are individuals in the general population. Evolutionarily speaking, this is a strong selective penalty. Depression also appears to reduce libido and may make one unattractive as a sexual partner. Andrews and Thomson point out depression’s “adverse effect on women’s fertility and the outcome of pregnancy.“ Other health problems are comorbid with depression, although it’s not clear whether depression is the cause or consequence of these problems. Finally, studies show that depressed mothers provide poorer care of their children.
As Coyne notes, this is a problem not only for ARH, but also for a number of other evolutionary psychological accounts of depression–essentially, all those theories that posit that the depressive state *itself* is adaptive (as opposed to balancing selection/heterozygote advantage models which allow for the possibility that some genes that contribute to depression may be selected for under the right circumstances, without implying that depression itself is advantageous).

A while ago I wrote about Andrews and Thomson’s adaptive rumination hypothesis (ARH) of depression, which holds that depression is an evolutionary adaption designed to help us solve difficult problems. I linked to two critiques (1, 2) of ARH by Jerry Coyne, who is clearly no fan of ARH. Coyne’s now taken his argument to the pages of Psychiatric Times, where he tears ARH to shreds for a third time. The main thrust of Coyne’s argument is that Andrews and Thomson employ a colloquial definition of adaptation (i.e., something that’s useful) rather than the more appropriate evolution definition:

Andrews and Thomson consider depression an “adaptation” because it supposedly helps the sufferer solve problems. But an evolutionary adaptation is more than something that is merely useful. Biologists consider a trait adaptive only if that behavior, and the genes producing it, enhance an individual’s fitness—the average lifetime output of offspring. It is this genetic advantage, and the evolutionary changes in behavior it promotes, that is the essence of adaptation by natural selection. To demonstrate that depression is an evolved adaptation, then, we must show that it enhances reproduction.

Andrews and Thomson don’t do this, or even try. And if they did try, they probably wouldn’t succeed, for everything we know about depression suggests that rather than enhancing fitness, it reduces it. The most obvious issue is suicide, a word that, curiously, does not appear in Andrews and Thomson’s text. Statistics show that those with major depression are 20 times more likely to kill themselves than are individuals in the general population. Evolutionarily speaking, this is a strong selective penalty. Depression also appears to reduce libido and may make one unattractive as a sexual partner. Andrews and Thomson point out depression’s “adverse effect on women’s fertility and the outcome of pregnancy.“ Other health problems are comorbid with depression, although it’s not clear whether depression is the cause or consequence of these problems. Finally, studies show that depressed mothers provide poorer care of their children.

As Coyne notes, this is a problem not only for ARH, but also for a number of other evolutionary psychological accounts of depression–essentially, all those theories that posit that the depressive state itself is adaptive (as opposed to balancing selection/heterozygote advantage models which allow for the possibility that some genes that contribute to depression may be selected for under the right circumstances, without implying that depression itself is advantageous).

academic bloggers on blogging

Is it wise for academics to blog? Depends on who you ask. Scott Sumner summarizes his first year of blogging this way:

Be careful what you wish for.  Last February 2nd I started this blog with very low expectations.  During the first three weeks most of the comments were from Aaron Jackson and Bill Woolsey.  I knew I wasn’t a good writer, years ago I got a referee report back from an anonymous referee (named McCloskey) who said “if the author had used no commas at all, his use of commas would have been more nearly correct.“  Ouch!  But it was true, others said similar things.  And I was also pretty sure that the content was not of much interest to anyone.

Now my biggest problem is time—I spend 6 to 10 hours a day on the blog, seven days a week.  Several hours are spent responding to reader comments and the rest is spent writing long-winded posts and checking other economics blogs.  And I still miss many blogs that I feel I should be reading. …

Regrets?  I’m pretty fatalistic about things.  I suppose it wasn’t a smart career move to spend so much time on the blog.  If I had ignored my commenters I could have had my manuscript revised by now. …  And I really don’t get any support from Bentley, as far as I know the higher ups don’t even know I have a blog. So I just did 2500 hours of uncompensated labor.

I don’t think Sumner actually regrets blogging (as the rest of his excellent post makes clear), but he does seem to think it’s hurt him professionally in some ways–most notably, because of all the time he spends blogging that he could be doing something else (like revising that manuscript).

Andrew Gelman has a very different take:

I agree with Sethi that Sumner’s post is interesting and captures much of the blogging experience. But I don’t agree with that last bit about it being a bad career move. Or perhaps Sumner was kidding? (It’s notoriously difficult to convey intonation in typed speech.) What exactly is the marginal value of his having a manuscript revised? It’s not like Bentley would be compensating him for that either, right? For someone like Sumner (or, for that matter, Alex Tabarrok or Tyler Cowen or my Columbia colleague Peter Woit), blogging would seem to be an excellent career move, both by giving them and their ideas much wider exposure than they otherwise would’ve had, and also (as Sumner himself notes) by being a convenient way to generate many thousands of words that can be later reworked into a book. This is particularly true of Sumner (more than Tabarrok or Cowen or, for that matter, me) because he tends to write long posts on common themes. (Rajiv Sethi, too, might be able to put together a book or some coherent articles by tying together his recent blog entries.)

Blogging and careers, blogging and careers . . . is blogging ever really bad for an academic career? I don’t know. I imagine that some academics spend lots of time on blogs that nobody reads, and that could definitely be bad for their careers in an opportunity-cost sort of way. Others such as Steven Levitt or Dan Ariely blog in an often-interesting but sometimes careless sort of way. This might be bad for their careers, but quite possibly they’ve reached a level of fame in which this sort of thing can’t really hurt them anymore. And this is fine; such researchers can make useful contributions with their speculations and let the Gelmans and Fungs of the world clean up after them. We each have our role in this food web. … And then of course there are the many many bloggers, academic and otherwise, whose work I assume I would’ve encountered much more rarely were they not blogging.

My own experience falls much more in line with Gelman’s here; my blogging experience has been almost wholly positive. Some of the benefits I’ve found to blogging regularly:

  • I’ve had many interesting email exchanges with people that started via a comment on something I wrote, and some of these will likely turn into collaborations at some point in the future.
  • I’ve been exposed to lots of interesting things (journal articles, blog posts, datasets, you name it) I wouldn’t have come across otherwise–either via links left in comments or sent by email, or while rooting around the web for things to write about.
  • I’ve gotten to publicize and promote my own research, which is always nice. As Gelman points out, it’s easier to learn about other people’s work if those people are actively blogging about it. I think that’s particularly true for people who are just starting out their careers.
  • I think blogging has improved both my ability and my willingness to write. By nature, I don’t actually like writing very much, and (like most academics I know) I find writing journal articles particularly unpleasant. Forcing myself to blog (semi-)regularly has instilled a certain discipline about writing that I haven’t always had, and if nothing else, it’s good practice.
  • I get to share ideas and findings I find interesting and/or important with other people. This is already what most academics do over drinks at conferences (and I think it’s a really important part of science), and blogging seems like a pretty natural extension.

All this isn’t to say that there aren’t any potential drawbacks to blogging. I think there are at least two important ones. One is the obvious point that, unless you’re blogging anonymously, it’s probably unwise to say things online that you wouldn’t feel comfortable saying in person. So, despite being a class-A jackass pretty critical by nature, I try to discuss things I like as often as things I don’t like–and to keep the tone constructive whenever I do the latter.

The other potential drawback, which both Sumner and Gelman allude to, is the opportunity cost. If you’re spending half of your daylight hours blogging, there’s no question it’s going to have an impact on your academic productivity. But in practice, I don’t think blogging too much is a problem many academic bloggers have. I usually find myself wishing most of the bloggers I read posted more often. In my own case, I almost exclusively blog after around 9 or 10 pm, when I’m no longer capable of doing sustained work on manuscripts anyway (I’m generally at my peak in the late morning and early afternoon). So, for me, blogging has replaced about ten hours a week of book reading/TV watching/web surfing, while leaving the amount of “real” work I do largely unchanged. That’s not really much of a cost, and I might even classify it as another benefit. With the admittedly important caveat that watching less television has made me undeniably useless at trivia night.

a possible link between pesticides and ADHD

A forthcoming article in the journal Pediatrics that’s been getting a lot of press attention suggests that exposure to common pesticides may be associated with a substantially elevated risk of ADHD. More precisely, what the study found was that elevated urinary concentrations of organophosphate metabolites were associated with an increased likelihood of meeting criteria for an ADHD diagnosis. One of the nice things about this study is that the authors used archival data from the (very large) National Health and Nutrition Examination Survey (NHANES), so they were able to control for a relatively broad range of potential confounds (e.g., gender, age, SES, etc.). The primary finding is, of course, still based on observational data, so you wouldn’t necessarily want to conclude that exposure to pesticides causes ADHD. But it’s a finding that converges with previous work in animal models demonstrating that high exposure to organophosphate pesticides causes neurodevelopmental changes, so it’s by no means a crazy hypothesis.

I think it’s really pleasantly surprising to see how responsibly the popular press has covered this story (e.g., this, this, and this). Despite the obvious potential for alarmism, very few articles have led with a headline implying a causal link between pesticides and ADHD. They all say things like “associated with”, “tied to”, or “linked to”, which is exactly right. And many even explicitly mention the size of the effect in question–namely, approximately a 50% increase in risk of ADHD per 10-fold increase in concentration of pesticide metabolites. Given that most of the articles contain cautionary quotes from the study’s authors, I’m guessing the authors really emphasized the study’s limitations when dealing with the press, which is great. In any case, because the basic details of the study have already been amply described elsewhere (I thought this short CBS article was particularly good), I’ll just mention a few random thoughts here:

  • Often, epidemiological studies suffer from a gaping flaw in the sense that the more interesting causal story (and the one that prompts media attention) is far less plausible than other potential explanations (a nice example of this is the recent work on the social contagion of everything from obesity to loneliness). That doesn’t seem to be the case here. Obviously, there are plenty of other reasons you might get a correlation between pesticide metabolites and ADHD risk–for instance, ADHD is substantially heritable, so it could be that parents with a disposition to ADHD also have systematically different dietary habits (i.e., parental dispositions are a common cause of both urinary metabolites and ADHD status in children). But given the aforementioned experimental evidence, it’s not obvious that alternative explanations for the correlation are much more plausible than the causal story linking pesticide exposure to ADHD, so in that sense this is potentially a very important finding.
  • The use of a dichotomous dependent variable (i.e., children either meet criteria for ADHD or don’t; there are no shades of ADHD gray here) is a real problem in this kind of study, because it can make the resulting effects seem deceptively large. The intuitive way we think about the members of a category is to think in terms of prototypes, so that when you think about “ADHD” and “Not-ADHD” categories, you’re probably mentally representing an extremely hyperactive, inattentive child for the former, and a quiet, conscientious kid for the latter. If that’s your mental model, and someone comes along and tells you that pesticide exposure increases the risk of ADHD by 50%, you’re understandably going to freak out, because it’ll seem quite natural to interpret that as a statement that pesticides have a 50% chance of turning average kids into hyperactive ones. But that’s not the right way to think about it. In all likelihood, pesticides aren’t causing a small proportion of kids to go from perfectly average to completely hyperactive; instead, what’s probably happening is that the entire distribution is shifting over slightly. In other words, most kids who are exposed to pesticides (if we assume for the sake of argument that there really is a causal link) are becoming slightly more hyperactive and/or inattentive.
  • Put differently, what happens when you have a strict cut-off for diagnosis is that even small increases in underlying symptoms can result in a qualitative shift in category membership. If ADHD symptoms were measured on a continuous scale (which they actually probably were, before being dichotomized to make things simple and more consistent with previous work), these findings might have been reported as something like “a 10-fold increase in pesticide exposures is associated with a 2-point increase on a 30-point symptom scale,” which would have made it much clearer that, at worst, pesticides are only one of many other contributing factors to ADHD, and almost certainly not nearly as big a factor as some others. That’s not to say we shouldn’t be concerned if subsequent work supports a causal link, but just that we should retain perspective on what’s involved. No one’s suggesting that you’re going to feed your child an unwashed pear or two and end up with a prescription for Ritalin; the more accurate view would be that you might have a minority of kids who are already at risk for ADHD, and this would be just one more precipitating factor.
  • It’s also worth keeping in mind that the relatively large increase in ADHD risk is associated with a ten-fold increase in pesticide metabolites. As the authors note, that corresponds to the difference between the 25th and 75th percentiles in the sample. Although we don’t know exactly what that means in terms of real-world exposure to pesticides (because the authors didn’t have any data on grocery shopping or eating habits), it’s almost certainly a very sizable difference (I won’t get into the reasons why, except to note that the rank-order of pesticide metabolites must be relatively stable among children, or else there wouldn’t be any association with a temporally-extended phenotype like ADHD). So the point is, it’s probably not so easy to go from the 25th to the 75th percentile just by eating a few more fruits and vegetables here and there. So while it’s certainly advisable to try and eat better, and potentially to buy organic produce (if you can afford it), you shouldn’t assume that you can halve your child’s risk of ADHD simply by changing his or her diet slightly. These are, at the end of the day, small effects.
  • The authors report that fully 12% of children in this nationally representative sample met criteria for ADHD (mostly of the inattentive subtype). This, frankly, says a lot more about how silly the diagnostic criteria for ADHD are than about the state of the nation’s children. It’s frankly not plausible to suppose that 1 in 8 children really suffer from what is, in theory at least, a severe, potentially disabling disorder. I’m not trying to trivialize ADHD or argue that there’s no such thing, but simply to point out the dangers of medicalization. Once you’ve reached the point where 1 in every 8 people meet criteria for a serious disorder, the label is in danger of losing all meaning.

ResearchBlogging.orgBouchard, M., Bellinger, D., Wright, R., & Weisskopf, M. (2010). Attention-Deficit/Hyperactivity Disorder and Urinary Metabolites of Organophosphate Pesticides PEDIATRICS DOI: 10.1542/peds.2009-3058

fMRI: coming soon to a courtroom near you?

Science magazine has a series of three (1, 2, 3) articles by Greg Miller over the past few days covering an interesting trial in Tennessee. The case itself seems like garden variety fraud, but the novel twist is that the defense is trying to introduce fMRI scans into the courtroom in order to establish the defendant’s innocent. As far as I can tell from Miller’s articles, the only scientists defending the use of fMRI as a lie detector are those employed by Cephos (the company that provides the scanning service); the other expert witnesses (including Marc Raichle!) seem pretty adamant that admitting fMRI scans as evidence would be a colossal mistake. Personally, I think there are several good reasons why it’d be a terrible, terrible, idea to let fMRI scans into the courtroom. In one way or another, they all boil down to the fact that just  isn’t any shred of evidence to support the use of fMRI as a lie detector in real-world (i.e, non-contrived) situations. Greg Miller has a quote from Martha Farah (who’s a spectator at the trial) that sums it up eloquently:

Farah sounds like she would have liked to chime in at this point about some things that weren’t getting enough attention. “No one asked me, but the thing we have not a drop of data on is [the situation] where people have their liberty at stake and have been living with a lie for a long time,” she says. She notes that the only published studies on fMRI lie detection involve people telling trivial lies with no threat of consequences. No peer-reviewed studies exist on real world situations like the case before the Tennessee court. Moreover, subjects in the published studies typically had their brains scanned within a few days of lying about a fake crime, whereas Semrau’s alleged crimes began nearly 10 years before he was scanned.

I’d go even further than this, and point out that even if there were studies that looked at ecologically valid lying, it’s unlikely that we’d be able to make any reasonable determination as to whether or not a particular individual was lying about a particular event. For one thing, most studies deal with group averages and not single-subject prediction; you might think that a highly statistically significant difference between two conditions (e.g., lying and not lying) necessarily implies a reasonable ability to make predictions at the single-subject level, but you’d be surprised. Prediction intervals for individual observations are typically extremely wide even when there’s a clear pattern at the group level. It’s just easier to make general statements about differences between conditions or groups than it is about what state a particular person is likely to be in given a certain set of conditions.

There is, admittedly, an emerging body of literature that uses pattern classification to make predictions about mental states at the level of individual subjects, and accuracy in these types of application can sometimes be quite high. But these studies invariably operate on relatively restrictive sets of stimuli within well-characterized domains (e.g., predicting which word out of a set of 60 subjects are looking at). This really isn’t “mind reading” in the sense that most people (including most judges and jurors) tend to think of it. And of course, even if you could make individual-level predictions reasonably accurately, it’s not clear that that’s good enough for the courtroom. As a scientist, I might be thrilled if I could predict which of 10 words you’re looking at with 80% accuracy (which, to be clear, is currently a pipe dream in the context of studies of ecologically valid lying). But as a lawyer, I’d probably be very skeptical of another lawyer who claimed my predictions vindicated their client. The fact that increased anterior cingulate activation tends to accompany lying on average isn’t a good reason to convict someone unless you can be reasonably certain that increased ACC activation accompanies lying for that person in that context when presented with that bit of information. At the moment, that’s a pretty hard sell.

As an aside, the thing I find perhaps most curious about the whole movement to use fMRI scanners as lie detectors is that there are very few studies that directly pit fMRI against more conventional lie detection techniques–namely, the polygraph. You can say what you like about the polygraph–and many people don’t think polygraph evidence should be admissible in court either–but at least it’s been around for a long time, and people know more or less what to expect from it. It’s easy to forget that it only makes sense to introduce fMRI scans (which are decidedly costly) as evidence if they do substantially better than polygraphs. Otherwise you’re just wasting a lot of money for a fancy brain image, and you could have gotten just as much information by simply measuring someone’s arousal level as you yell at them about that bloodstained Cadillac that was found parked in their driveway on the night of January 7th. But then, maybe that’s the whole point of trying to introduce fMRI to the courtroom; maybe lawyers know that the polygraph has a tainted reputation, and are hoping that fancy new brain scanning techniques that come with pretty pictures don’t carry the same baggage. I hope that’s not true, but I’ve learned to be cynical about these things.

At any rate, the Science articles are well worth a read, and since the judge hasn’t yet decided whether or not to allow fMRI or not, the next couple of weeks should be interesting…

[hat-tip: Thomas Nadelhoffer]

in defense of three of my favorite sayings

Seth Roberts takes issue with three popular maxims that (he argues) people use “to push away data that contradicts this or that approved view of the world”. He terms this preventive stupidity. I’m a frequent user of all three sayings, so I suppose that might make me preventively stupid; but I do feel like I have good reasons for using these sayings, and I confess to not really seeing Roberts’ point.

Here’s what Roberts has to say about the three sayings in question:

1. Absence of evidence is not evidence of absence. Øyhus explains why this is wrong. That such an Orwellian saying is popular in discussions of data suggests there are many ways we push away inconvenient data.

In my own experience, by far the biggest reason this saying is popular in discussions of data (and the primary reason I use it when reviewing papers) is that many people have a very strong tendency to interpret null results as an absence of any meaningful effect. That’s a very big problem, because the majority of studies in psychology tend to have relatively little power to detect small to moderate-sized effects. For instance, as I’ve discussed here, most whole-brain analyses in typical fMRI samples (of say, 15 – 20 subjects) have very little power to detect anything but massive effects. And yet people routinely interpret a failure to detect hypothesized effects as an indication that they must not exist at all. The simplest and most direct counter to this type of mistake is to note that one shouldn’t accept the null hypothesis unless one has very good reasons to think that power is very high and effect size estimates are consequently quite accurate. Which is just another way of saying that absence of evidence is not evidence of absence.

2. Correlation does not equal causation. In practice, this is used to mean that correlation is not evidence for causation. At UC Berkeley, a job candidate for a faculty position in psychology said this to me. I said, “Isn’t zero correlation evidence against causation?“ She looked puzzled.

Again, Roberts’ experience clearly differs from mine; I’ve far more often seen this saying used as a way of suggesting that a researcher may be drawing overly strong causal conclusions from the data, not as a way of simply dismissing a correlation outright. A good example of this is found in the developmental literature, where many researchers have observed strong correlations between parents’ behavior and their children’s subsequent behavior. It is, of course, quite plausible to suppose that parenting behavior exerts a direct causal influence on children’s behavior, so that the children of negligent or abusive parents are more likely to exhibit delinquent behavior and grow up to perpetuate the “cycle of violence”. But this line of reasoning is substantially weakened by behavioral genetic studies indicating that very little of the correlation between parents’ and children’s personalities is explained by shared environmental factors, and that the vast majority reflects heritable influences and/or unique environmental influences. Given such findings, it’s a perfectly appropriate rebuttal to much of the developmental literature to note that correlation doesn’t imply causation.

It’s also worth pointing out that the anecdote Roberts provides isn’t exactly a refutation of the maxim; it’s actually an affirmation of the consequent. The fact that an absence of any correlation could potentially be strong evidence against causation (under the right circumstances) doesn’t mean that the presence of a correlation is strong evidence for causation. It may or may not be, but that’s something to be weighed on a case-by-case basis. There certainly are plenty of cases where it’s perfectly appropriate (and even called for) to remind someone that correlation doesn’t imply causation.

3. The plural of anecdote is not data. How dare you try to learn from stories you are told or what you yourself observe!

I suspect this is something of a sore spot for Roberts, who’s been an avid proponent of self-experimentation and case studies. I imagine people often dismiss his work as mere anecdote rather than valuable data. Personally, I happen to think there’s tremendous value to self-experimentation (at least when done in as controlled a manner as possible), so I don’t doubt there are many cases where this saying is unfairly applied. That said, I think Roberts fails to appreciate that people who do his kind of research constitute a tiny fraction of the population. Most of the time, when someone says that “the plural of anecdote is not data,” they’re not talking to someone who does rigorous self-experimentation, but to people who, say, don’t believe they should give up smoking seeing as how their grandmother smoked till she was 88 and died in a bungee-jumping accident, or who are convinced that texting while driving is perfectly acceptable because they don’t personally know anyone who’s gotten in an accident. In such cases, it’s not only legitimate but arguably desirable to point out that personal anecdote is no substitute for hard data.

Orwell was right. People use these sayings — especially #1 and #3 — to push away data that contradicts this or that approved view of the world. Without any data at all, the world would be simpler: We would simply believe what authorities tell us. Data complicates things. These sayings help those who say them ignore data, thus restoring comforting certainty.

Maybe there should be a term (antiscientific method?) to describe the many ways people push away data. Or maybe preventive stupidity will do.

I’d like to be charitable here, since there very clearly are cases where Roberts’ point holds true: sometimes people do toss out these sayings as a way of not really contending with data they don’t like. But frankly, the general claim that these sayings are antiscientific and constitute an act of stupidity just seems silly. All three sayings are clearly applicable in a large number of situations; to deny that, you’d have to believe that (a) it’s always fine to accept the null hypothesis, (b) correlation is always a good indicator of a causal relationship, and (c) personal anecdotes are just as good as large, well-controlled studies. I take it that no one, including Roberts, really believes that. So then it becomes a matter of when to apply these sayings, and not whether or not to use them. After all, it’d be silly to think that the people who use these sayings are always on the side of darkness, and the people who wield null results, correlations, and anecdotes with reckless abandon are always on the side of light.

My own experience, for what it’s worth, is that the use of these sayings is justified far more often than not, and I don’t have any reservation applying them myself when I think they’re warranted (which is relatively often–particularly the first one). But I grant that that’s just my own personal experience talking, and no matter how many experiences I’ve had of people using these sayings appropriately, I’m well aware that the plural of anecdote…

the capricious nature of p < .05, or why data peeking is evil

There’s a time-honored tradition in the social sciences–or at least psychology–that goes something like this. You decide on some provisional number of subjects you’d like to run in your study; usually it’s a nice round number like twenty or sixty, or some number that just happens to coincide with the sample size of the last successful study you ran. Or maybe it just happens to be your favorite number (which of course is forty-four). You get your graduate student to start running the study, and promptly forget about it for a couple of weeks while you go about writing up journal reviews that are three weeks overdue and chapters that are six months overdue.

A few weeks later, you decide you’d like to know how that Amazing New Experiment you’re running is going. You summon your RA and ask him, in magisterial tones, “how’s that Amazing New Experiment we’re running going?” To which he falteringly replies that he’s been very busy with all the other data entry and analysis chores you assigned him, so he’s only managed to collect data from eighteen subjects so far. But he promises to have the other eighty-two subjects done any day now.

“Not to worry,” you say. “We’ll just take a peek at the data now and see what it looks like; with any luck, you won’t even need to run any more subjects! By the way, here are my car keys; see if you can’t have it washed by 5 pm. Your job depends on it. Ha ha.”

Once your RA’s gone to soil himself somewhere, you gleefully plunge into the task of peeking at your data. You pivot your tables, plyr your data frame, and bravely sort your columns. Then you extract two of the more juicy variables for analysis, and after some careful surgery a t-test or six, you arrive at the conclusion that your hypothesis is… “marginally” supported. Which is to say, the magical p value is somewhere north of .05 and somewhere south of .10, and now it’s just parked by the curb waiting for you to give it better directions.

You briefly contemplate reporting your result as a one-tailed test–since it’s in the direction you predicted, right?–but ultimately decide against that. You recall the way your old Research Methods professor used to rail at length against the evils of one-sample tests, and even if you don’t remember exactly why they’re so evil, you’re not willing to take any chances. So you decide it can’t be helped; you need to collect some more data.

You summon your RA again. “Is my car washed yet?” you ask.

“No,” says your RA in a squeaky voice. “You just asked me to do that fifteen minutes ago.”

“Right, right,” you say. “I knew that.”

You then explain to your RA that he should suspend all other assigned duties for the next few days and prioritize running subjects in the Amazing New Experiment. “Abandon all other tasks!” you decree. “If it doesn’t involve collecting new data, it’s unimportant! Your job is to eat, sleep, and breathe new subjects! But not literally!”

Being quite clever, your RA sees an opening. “I guess you’ll want your car keys back, then,” he suggests.

“Nice try, Poindexter,” you say. “Abandon all other tasks… starting tomorrow.”

You also give your RA very careful instructions to email you the new data after every single subject, so that you can toss it into your spreadsheet and inspect the p value at every step. After all, there’s no sense in wasting perfectly good data; once your p value is below .05, you can just funnel the rest of the participants over to the Equally Amazing And Even Newer Experiment you’ve been planning to run as a follow-up. It’s a win-win proposition for everyone involved. Except maybe your RA, who’s still expected to return triumphant with a squeaky clean vehicle by 5 pm.

Twenty-six months and four rounds of review later, you publish the results of the Amazing New Experiment as Study 2 in a six-study paper in the Journal of Ambiguous Results. The reviewers raked you over the coals for everything from the suggested running head of the paper to the ratio between the abscissa and the ordinate in Figure 3. But what they couldn’t argue with was the p value in Study 2, which clocked in at just under p < .05, with only 21 subjects’ worth of data (compare that to the 80 you had to run in Study 4 to get a statistically significant result!). Suck on that, Reviewers!, you think to yourself pleasantly while driving yourself home from work in your shiny, shiny Honda Civic.

So ends our short parable, which has at least two subtle points to teach us. One is that it takes a really long time to publish anything; who has time to wait twenty-six months and go through four rounds of review?

The other, more important point, is that the desire to peek at one’s data, which often seems innocuous enough–and possibly even advisable (quality control is important, right?)–can actually be quite harmful. At least if you believe that the goal of doing research is to arrive at the truth, and not necessarily to publish statistically significant results.

The basic problem is that peeking at your data is rarely a passive process; most often, it’s done in the context of a decision-making process, where the goal is to determine whether or not you need to keep collecting data. There are two possible peeking outcomes that might lead you to decide to halt data collection: a very low p value (i.e., p < .05), in which case your hypothesis is supported and you may as well stop gathering evidence; or a very high p value, in which case you might decide that it’s unlikely you’re ever going to successfully reject the null, so you may as well throw in the towel. Either way, you’re making the decision to terminate the study based on the results you find in a provisional sample.

A complementary situation, which also happens not infrequently, occurs when you collect data from exactly as many participants as you decided ahead of time, only to find that your results aren’t quite what you’d like them to be (e.g., a marginally significant hypothesis test). In that case, it may be quite tempting to keep collecting data even though you’ve already hit your predetermined target. I can count on more than one hand the number of times I’ve overheard people say (often without any hint of guilt) something to the effect of “my p value’s at .06 right now, so I just need to collect data from a few more subjects.”

Here’s the problem with either (a) collecting more data in an effort to turn p < .06 into p < .05, or (b) ceasing data collection because you’ve already hit p < .05: any time you add another subject to your sample, there’s a fairly large probability the p value will go down purely by chance, even if there’s no effect. So there you are sitting at p < .06 with twenty-four subjects, and you decide to run a twenty-fifth subject. Well, let’s suppose that there actually isn’t a meaningful effect in the population, and that p < .06 value you’ve got is a (near) false positive. Adding that twenty-fifth subject can only do one of two things: it can raise your p value, or it can lower it. The exact probabilities of these two outcomes depends on the current effect size in your sample before adding the new subject; but generally speaking, they’ll rarely be very far from 50-50. So now you can see the problem: if you stop collecting data as soon as you get a significant result, you may well be capitalizing on chance. It could be that if you’d collected data from a twenty-sixth and twenty-seventh subject, the p value would reverse its trajectory and start rising. It could even be that if you’d collected data from two hundred subjects, the effect size would stabilize near zero. But you’d never know that if you stopped the study as soon as you got the results you were looking for.

Lest you think I’m exaggerating, and think that this problem falls into the famous class of things-statisticians-and-methodologists-get-all-anal-about-but-that-don’t-really-matter-in-the-real-world, here’s a sobering figure (taken from this chapter):


The figure shows the results of a simulation quantifying the increase in false positives associated with data peeking. The assumptions here are that (a) data peeking begins after about 10 subjects (starting earlier would further increase false positives, and starting later would decrease false positives somewhat), (b) the researcher stops as soon as a peek at the data reveals a result significant at p < .05, and (c) data peeking occurs at incremental steps of either 1 or 5 subjects. Given these assumptions, you can see that there’s a fairly monstrous rise in the actual Type I error rate (relative to the nominal rate of 5%). For instance, if the researcher initially plans to collect 60 subjects, but peeks at the data after every 5 subjects, there’s approximately a 17% chance that the threshold of p < .05 will be reached before the full sample of 60 subjects is collected. When data peeking occurs even more frequently (as might happen if a researcher is actively trying to turn p < .07 into p < .05, and is monitoring the results after each incremental participant), Type I error inflation is even worse. So unless you think there’s no practical difference between a 5% false positive rate and a 15 – 20% false positive rate, you should be concerned about data peeking; it’s not the kind of thing you just brush off as needless pedantry.

How do we stop ourselves from capitalizing on chance by looking at the data? Broadly speaking, there are two reasonable solutions. One is to just pick a number up front and stick with it. If you commit yourself to collecting data from exactly as many subjects as you said you would (you can proclaim the exact number loudly to anyone who’ll listen, if you find it helps), you’re then free to peek at the data all you want. After all, it’s not the act of observing the data that creates the problem; it’s the decision to terminate data collection based on your observation that matters.

The other alternative is to explicitly correct for data peeking. This is a common approach in large clinical trials, where data peeking is often ethically mandated, because you don’t want to either (a) harm people in the treatment group if the treatment turns out to have clear and dangerous side effects, or (b) prevent the control group from capitalizing on the treatment too if it seems very efficacious. In either event, you’d want to terminate the trial early. What researchers often do, then, is pick predetermined intervals at which to peek at the data, and then apply a correction to the p values that takes into account the number of, and interval between, peeking occasions. Provided you do things systematically in that way, peeking then becomes perfectly legitimate. Of course, the downside is that having to account for those extra inspections of the data makes your statistical tests more conservative. So if there aren’t any ethical issues that necessitate peeking, and you’re not worried about quality control issues that might be revealed by eyeballing the data, your best bet is usually to just pick a reasonable sample size (ideally, one based on power calculations) and stick with it.

Oh, and also, don’t make your RAs wash your car for you; that’s not their job.

de Waal and Ferrari on cognition in humans and animals

Humans do many things that most animals can’t. That much no one would dispute. The more interesting and controversial question is just how many things we can do that most animals can’t, and just how many animal species can or can’t do the things we do. That question is at the center of a nice opinion piece in Trends in Cognitive Sciences by Frans de Waal and Pier Francisco Ferrari.

De Waal and Ferrari argue for what they term a bottom-up approach to human and animal cognition. The fundamental idea–which isn’t new, and in fact owes much to decades of de Waal’s own work with primates–is that most of our cognitive abilities, including many that are often characterized as uniquely human, are in fact largely continuous with abilities found in other species. De Waal and Ferrari highlight a number of putatively “special” functions like imitation and empathy that turn out to have relatively frequent primate (and in some cases non-primate) analogs. They push for a bottom-up scientific approach that seeks to characterize the basic mechanisms that complex functionality might have arisen out of, rather than (what they see as) “the overwhelming tendency outside of biology to give human cognition special treatment.”

Although I agree pretty strongly with the thesis of the paper, its scope is also, in some ways, quite limited: De Waal and Ferrari clearly believe that many complex functions depend on homologous mechanisms in both humans and non-human primates, but they don’t actually say very much about what these mechanisms might be, save for some brief allusions to relatively broad neural circuits (e.g., the oft-criticized mirror neuron system, which Ferrari played a central role in identifying and characterizing). To some extent that’s understandable given the brevity of TICS articles, but given how much de Waal has written about primate cognition, it would have been nice to see a more detailed example of the types of cognitive representations de Waal thinks underlie, say, the homologous abilities of humans and capuchin monkeys empathize with conspecifics.

Also, despite its categorization as an “Opinion” piece (these are supposed to stir up debate), I don’t think many people (at least, the kind of people who read TICS articles) are going to take issue with the basic continuity hypothesis advanced by de Waal and Ferrari. I suspect many more people would agree than disagree with the notion that most complex cognitive abilities displayed by humans share a closely intertwined evolutionary history with seemingly less sophisticated capacities displayed by primates and other mammalian species. So in that sense, de Waal and Ferrari might be accused of constructing something of a straw man. But it’s important to recognize that de Waal’s own work is a very large part of the reason why the continuity hypothesis is so widely accepted these days. So in that sense, even if you already agree with its premise, the TICS paper is worth reading simply as an elegant summary of a long-standing and important line of research.

in brief…

Some neat stuff from the past week or so:

  • If you’ve ever wondered how to go about getting a commentary on an article published in a peer-reviewed journal, wonder no longer… you can’t. Or rather, you can, but it may not be worth your trouble. Rick Trebino explains. [new to me via A.C. Thomas, though apparently this one’s been around for a while.]
  • The data-driven life: A great article in the NYT magazine discusses the growing number of people who’re quantitatively recording the details of every aspect of their lives, from mood to glucose levels to movement patterns. I dabbled with this a few years ago, recording my mood, diet, and exercise levels for about 6 months. I’m not sure how much I learned that was actually useful, but if nothing else, it’s a fun exercise to play aroundwith a giant matrix of correlations that are all about YOU.
  • Cameron Neylon has an excellent post up defending the viability (and superiority) of the author-pays model of publication.
  • In typical fashion, Carl Zimmer has a wonderful blog up post explaining why tapeworms in Madagascar tell us something important about human evolution.
  • The World Bank, as you might expect, has accumulated a lot of economic data. For years, they’ve been selling it at a premium, but as of 2010 the World Development Indicators are completely free to access. via [via Flowing Data]
  • Every tried Jew’s Ear Juice? No? In China, you can–but not for long, if the government has its way. The NYT reports on efforts to eradicate Chinglish in public. Money quote:

“The purpose of signage is to be useful, not to be amusing,” said Zhao Huimin, the former Chinese ambassador to the United States who, as director general of the capital’s Foreign Affairs Office, has been leading the fight for linguistic standardization and sobriety.

more on the absence of brain training effects

A little while ago I blogged about the recent Owen et al Nature study on the (null) effects of cognitive training. My take on the study, which found essentially no effect of cognitive training on generalized cognitive performance, was largely positive. In response, Martin Walker, founder of Mind Sparke, maker of Brain Fitness Pro software, left this comment:

I’ve done regular aerobic training for pretty much my entire life, but I’ve never had the kind of mental boost from exercise that I have had from dual n-back training. I’ve also found that n-back training helps my mood.

There was a foundational problem with the BBC study in that it didn’t provide anywhere near the intensity of training that would be required to show transfer. The null hypothesis was a forgone conclusion. It seems hard to believe that the scientists didn’t know this before they began and were setting out to debunk the populist brain game hype.

I think there are a couple of points worth making. One is the standard rejoinder that one anecdotal report doesn’t count for very much. That’s not meant as a jibe at Walker in particular, but simply as a general observation about the fallibility of human judgment. Many people are perfectly convinced that homeopathic solutions have dramatically improved their quality of life, but that doesn’t mean we should take homeopathy seriously. Of course, I’m not suggesting that cognitive training programs are as ineffectual as homeopathy–in my post, I suggested they may well have some effect–but simply that personal testimonials are no substitute for controlled studies.

With respect to the (also anecdotal) claim that aerobic exercise hasn’t worked for Walker, it’s worth noting that the effects of aerobic exercise on cognitive performance take time to develop. No one expects a single brisk 20-minute jog to dramatically improve cognitive performance. If you’ve been exercising regularly your whole life, the question isn’t whether exercise will improve your cognitive function–it’s whether not doing any exercise for a month or two would lead to poorer performance. That is, if Walker stopped exercising, would his cognitive performance suffer? It would be a decidedly unhealthy hypothesis to test, of course, but that would really be the more reasonable prediction. I don’t think anyone thinks that a person in excellent physical condition would benefit further from physical exercise; the point is precisely that most people aren’t in excellent physical shape. In any event, as I noted in my post, the benefits of aerobic exercise are clearly largest for older adults who were previously relatively sedentary. There’s much less evidence for large effects of aerobic exercise on cognitive performance in young or middle-aged adults.

The more substantive question Walker raises has to do with whether the tasks Owen et al used were too easy to support meaningful improvement. I think this is a reasonable question, but I don’t think the answer is as straightforward as Walker suggests. For one thing, participants in the Owen et al study did show substantial gains in performance on the training tasks (just not the untrained tasks), so it’s not like they were at ceiling. That is, the training tasks clearly weren’t easy. Second, participants varied widely in the number of training sessions they performed, and yet, as the authors note, the correlation between amount of training and cognitive improvement was negligible. So if you extrapolate from the observed pattern, it doesn’t look particularly favorable. Third, Owen et al used 12 different training tasks that spanned a broad range of cognitive abilities. While one can quibble with any individual task, it’s hard to reconcile the overall pattern of null results with the notion that cognitive training produces robust effects. Surely at least some of these measures should have led to a noticeable overall effect if they successfully produced transfer. But they didn’t.

To reiterate what I said in my earlier post, I’m not saying that cognitive training has absolutely no effect. No study is perfect, and it’s conceivable that more robust effects might be observed given a different design. But the Owen et al study is, to put it bluntly, the largest study of cognitive training conducted to date by about two orders of magnitude, and that counts for a lot in an area of research dominated by relatively small studies that have generally produced mixed findings. So, in the absence of contradictory evidence from another large training study, I don’t see any reason to second-guess Owen et al’s conclusions.

Lastly, I don’t think Walker is in any position to cast aspersions on people’s motivations (“It seems hard to believe that the scientists didn’t know this before they began and were setting out to debunk the populist brain game hype”). While I don’t think that his financial stake in brain training programs necessarily impugns his evaluation of the Owen et al study, it can’t exactly promote impartiality either. And for what it’s worth, I dug around the Mind Sparke website and couldn’t find any “scientific proof” that the software works (which is what the website claims)–just some vague allusions to customer testimonials and some citations of other researchers’ published work (none of which, as far as I can tell, used Brain Fitness Pro for training).

everything we know about the neural bases of cognitive control, in 20 review articles or less

Okay, not everything. But a lot of what we know. The current issue of Current Opinion in Neurobiology, which features a special focus on cognitive neuroscience, contains are almost 20 short review papers, most of which focus on the neural mechanisms of cognitive control in one guise or another. As the Editors of the special issue (Earl Miller and Liz Phelps) explain in their introduction:

Our goal with this special issue was to highlight integrative approaches to brain function. To this end, we focused on the most integrative of brain functions, cognitive control. Cognitive, or executive, control is the ability to coordinate thought and action by directing them toward goals, often far removed goals.

I’ve only skimmed a couple of articles so far, but it’s a pretty impressive table of contents, and I’m looking forward to reading a lot of the reviews. The nice thing about the Current Opinion series, like the Trends series, is that the reviews are short and focused, so they’re well-suited to people who are very busy and don’t have enough hours in their day (like you), or people who just have a short attention span (like me).

Admittedly, I also have an ulterior motive for mentioning this issue: Todd Braver, Mike Cole and I contributed one of the articles, in which we review the neural bases of individual differences in executive control. I think it’s a really nice paper, the credit for which really goes to Todd and Mike–I mostly just contributed the section on methodological considerations (which is basically a precis of a much longer chapter I wrote with Todd a couple of years ago). Todd and Mike somehow managed to review work on everything from reward and motivation to emotion regulation to working memory capacity to dopamine genes, all in the space of eight pages. It’s a nice review highlighting the importance of modeling not only the central tendency of people’s behavior and brain activation in cognitive neuroscience studies, but also the variation between individuals. Aside from the fact that many people (including me!) find individual differences in cognitive abilities intrinsically interesting, an individual differences approach can provide insights that naturally complement those identified by more common within-subject analyses.

For instance, there’s a giant literature on the critical role the neurotransmitter dopamine plays in maintaining and updating goal representations. Most process models of dopamine function make either explicit or tacit predictions about how individual differences in dopamine function should manifest behaviorally, and recent studies have sought to test some of these predictions using both neuroimaging and molecular genetic techniques. A lot of work has focused on a common polymorphism in the COMT gene, variants of which dramatically alter the efficiency of dopamine degradation in the prefrontal cortex. An (admittedly simplistic) prediction that follows from one standard view of prefrontal dopamine function (that tonic dopamine serves to stabilize active representations) is that people who possess the low-activity met allele (and consequently have higher dopamine levels in PFC) should have a greater capacity to maintain goal representations and sustain attention, which may manifest as improved performance on many working memory tasks. Conversely, people with the val allele, which is associated with lower tonic dopamine levels in PFC, should do worse at tasks requiring sustained attention, but may have greater cognitive flexibility (due to the capacity to switch between goal representations more easily).

This prediction, which is borne out by a number of studies we review, is fundamentally about individual differences, since we typically can’t manipulate people’s COMT genes in the lab (though I know some people who probably really wish we could!). But the point is, even if you’re not intrinsically interested in what makes people different from one another, studying individual variation at a genetic, neural, or behavioral level can often tell you something useful about the models you’re developing. Particularly when it comes to the domain of executive control, where differences between individuals can be quite striking. Almost any mechanistic model of executive control is going to have ‘joints’ that could theoretically vary systematically across individuals, so it makes sense to capitalize on natural variability between people to test some of the predictions that fall out of the model, instead of just treating between-subject variability as the error term in your one-sample t-test.

Anyway, our article is here, and the full issue is here (though it’s behind a paywall, unfortunately).