what the Dunning-Kruger effect is and isn’t

If you regularly read cognitive science or psychology blogs (or even just the lowly New York Times!), you’ve probably heard of something called the Dunning-Kruger effect. The Dunning-Kruger effect refers to the seemingly pervasive tendency of poor performers to overestimate their abilities relative to other people–and, to a lesser extent, for high performers to underestimate their abilities. The explanation for this, according to Kruger and Dunning, who first reported the effect in an extremely influential 1999 article in the Journal of Personality and Social Psychology, is that incompetent people by lack the skills they’d need in order to be able to distinguish good performers from bad performers:

…people who lack the knowledge or wisdom to perform well are often unaware of this fact. We attribute this lack of awareness to a deficit in metacognitive skill. That is, the same incompetence that leads them to make wrong choices also deprives them of the savvy necessary to recognize competence, be it their own or anyone else’s.

For reasons I’m not really clear on, the Dunning-Kruger effect seems to be experiencing something of a renaissance over the past few months; it’s everywhere in the blogosphere and media. For instance, here are just a few alleged Dunning-Krugerisms from the past few weeks:

So what does this mean in business? Well, it’s all over the place. Even the title of Dunning and Kruger’s paper, the part about inflated self-assessments, reminds me of a truism that was pointed out by a supervisor early in my career: The best employees will invariably be the hardest on themselves in self-evaluations, while the lowest performers can be counted on to think they are doing excellent work…

Heidi Montag and Spencer Pratt are great examples of the Dunning-Kruger effect. A whole industry of assholes are making a living off of encouraging two attractive yet untalented people they are actually genius auteurs. The bubble around them is so thick, they may never escape it. At this point, all of America (at least those who know who they are), is in on the joke ““ yet the two people in the center of this tragedy are completely unaware…

Not so fast there — the Dunning-Kruger effect comes into play here. People in the United States do not have a high level of understanding of evolution, and this survey did not measure actual competence. I’ve found that the people most likely to declare that they have a thorough knowledge of evolution are the creationists“¦but that a brief conversation is always sufficient to discover that all they’ve really got is a confused welter of misinformation…

As you can see, the findings reported by Kruger and Dunning are often interpreted to suggest that the less competent people are, the more competent they think they are. People who perform worst at a task tend to think they’re god’s gift to said task, and the people who can actually do said task often display excessive modesty. I suspect we find this sort of explanation compelling because it appeals to our implicit just-world theories: we’d like to believe that people who obnoxiously proclaim their excellence at X, Y, and Z must really not be so very good at X, Y, and Z at all, and must be (over)compensating for some actual deficiency; it’s much less pleasant to imagine that people who go around shoving their (alleged) superiority in our faces might really be better than us at what they do.

Unfortunately, Kruger and Dunning never actually provided any support for this type of just-world view; their studies categorically didn’t show that incompetent people are more confident or arrogant than competent people. What they did show is this:

This is one of the key figures from Kruger and Dunning’s 1999 paper (and the basic effect has been replicated many times since). The critical point to note is that there’s a clear positive correlation between actual performance (gray line) and perceived performance (black line): the people in the top quartile for actual performance think they perform better than the people in the second quartile, who in turn think they perform better than the people in the third quartile, and so on. So the bias is definitively not that incompetent people think they’re better than competent people. Rather, it’s that incompetent people think they’re much better than they actually are. But they typically still don’t think they’re quite as good as people who, you know, actually are good. (It’s important to note that Dunning and Kruger never claimed to show that the unskilled think they’re better than the skilled; that’s just the way the finding is often interpreted by others.)

That said, it’s clear that there is a very large discrepancy between the way incompetent people actually perform and the way they perceive their own performance level, whereas the discrepancy is much smaller for highly competent individuals. So the big question is why. Kruger and Dunning’s explanation, as I mentioned above, is that incompetent people lack the skills they’d need in order to know they’re incompetent. For example, if you’re not very good at learning languages, it might be hard for you to tell that you’re not very good, because the very skills that you’d need in order to distinguish someone who’s good from someone who’s not are the ones you lack. If you can’t hear the distinction between two different phonemes, how could you ever know who has native-like pronunciation ability and who doesn’t? If you don’t understand very many words in another language, how can you evaluate the size of your own vocabulary in relation to other people’s?

This appeal to people’s meta-cognitive abilities (i.e., their knowledge about their knowledge) has some intuitive plausibility, and Kruger, Dunning and their colleagues have provided quite a bit of evidence for it over the past decade. That said, it’s by no means the only explanation around; over the past few years, a fairly sizeable literature criticizing or extending Kruger and Dunning’s work has developed. I’ll mention just three plausible (and mutually compatible) alternative accounts people have proposed (but there are others!)

1. Regression toward the mean. Probably the most common criticism of the Dunning-Kruger effect is that it simply reflects regression to the mean–that is, it’s a statistical artifact. Regression to the mean refers to the fact that any time you select a group of individuals based on some criterion, and then measure the standing of those individuals on some other dimension, performance levels will tend to shift (or regress) toward the mean level. It’s a notoriously underappreciated problem, and probably explains many, many phenomena that people have tried to interpret substantively. For instance, in placebo-controlled clinical trials of SSRIs, depressed people tend to get better in both the drug and placebo conditions. Some of this is undoubtedly due to the placebo effect, but much of it is probably also due to what’s often referred to as “natural history”. Depression, like most things, tends to be cyclical: people get better or worse better over time, often for no apparent rhyme or reason. But since people tend to seek help (and sign up for drug trials) primarily when they’re doing particularly badly, it follows that most people would get better to some extent even without any treatment. That’s regression to the mean (the Wikipedia entry has other nice examples–for example, the famous Sports Illustrated Cover Jinx).

In the context of the Dunning-Kruger effect, the argument is that incompetent people simply regress toward the mean when you ask them to evaluate their own performance. Since perceived performance is influenced not only by actual performance, but also by many other factors (e.g., one’s personality, meta-cognitive ability, measurement error, etc.), it follows that, on average, people with extreme levels of actual performance won’t be quite as extreme in terms of their perception of their performance. So, much of the Dunning-Kruger effect arguably doesn’t need to be explained at all, and in fact, it would be quite surprising if you didn’t see a pattern of results that looks at least somewhat like the figure above.

2. Regression to the mean plus better-than-average. Having said that, it’s clear that regression to the mean can’t explain everything about the Dunning-Kruger effect. One problem is that it doesn’t explain why the effect is greater at the low end than at the high end. That is, incompetent people tend to overestimate their performance to a much greater extent than competent people underestimate their performance. This asymmetry can’t be explained solely by regression to the mean. It can, however, be explained by a combination of RTM and a “better-than-average” (or self-enhancement) heuristic which says that, in general, most people have a tendency to view themselves excessively positively. This two-pronged explanation was proposed by Krueger and Mueller in a 2002 study (note that Krueger and Kruger are different people!), who argued that poor performers suffer from a double whammy: not only do their perceptions of their own performance regress toward the mean, but those perceptions are also further inflated by the self-enhancement bias. In contrast, for high performers, these two effects largely balance each other out: regression to the mean causes high performers to underestimate their performance, but to some extent that underestimation is offset by the self-enhancement bias. As a result, it looks as though high performers make more accurate judgments than low performers, when in reality the high performers are just lucky to be where they are in the distribution.

3. The instrumental role of task difficulty. Consistent with the notion that the Dunning-Kruger effect is at least partly a statistical artifact, some studies have shown that the asymmetry reported by Kruger and Dunning (i.e., the smaller discrepancy for high performers than for low performers) actually goes away, and even reverses, when the ability tests given to participants are very difficult. For instance, Burson and colleagues (2006), writing in JPSP, showed that when University of Chicago undergraduates were asked moderately difficult trivia questions about their university, the subjects who performed best were just as poorly calibrated as the people who performed worst, in the sense that their estimates of how well they did relative to other people were wildly inaccurate. Here’s what that looks like:

Notice that this finding wasn’t anomalous with respect to the Kruger and Dunning findings; when participants were given easier trivia (the diamond-studded line), Burson et al observed the standard pattern, with poor performers seemingly showing worse calibration. Simply knocking about 10% off the accuracy rate on the trivia questions was enough to induce a large shift in the relative mismatch between perceptions of ability and actual ability. Burson et al then went on to replicate this pattern in two additional studies involving a number of different judgments and tasks, so this result isn’t specific to trivia questions. In fact, in the later studies, Burson et al showed that when the task was really difficult, poor performers were actually considerably better calibrated than high performers.

Looking at the figure above, it’s not hard to see why this would be. Since the slope of the line tends to be pretty constant in these types of experiments, any change in mean performance levels (i.e., a shift in intercept on the y-axis) will necessarily result in a larger difference between actual and perceived performance at the high end. Conversely, if you raise the line, you maximize the difference between actual and perceived performance at the lower end.

To get an intuitive sense of what’s happening here, just think of it this way: if you’re performing a very difficult task, you’re probably going to find the experience subjectively demanding even if you’re at the high end relative to other people. Since people’s judgments about their own relative standing depends to a substantial extent on their subjective perception of their own performance (i.e., you use your sense of how easy a task was as a proxy of how good you must be at it), high performers are going to end up systematically underestimating how well they did. When a task is difficult, most people assume they must have done relatively poorly compared to other people. Conversely, when a task is relatively easy (and the tasks Dunning and Kruger studied were on the easier side), most people assume they must be pretty good compared to others. As a result, it’s going to look like the people who perform well are well-calibrated when the task is easy and poorly-calibrated when the task is difficult; less competent people are going to show exactly the opposite pattern. And note that this doesn’t require us to assume any relationship between actual performance and perceived performance. You would expect to get the Dunning-Kruger effect for easy tasks even if there was exactly zero correlation between how good people actually are at something and how good they think they are.

Here’s how Burson et al summarized their findings:

Our studies replicate, eliminate, or reverse the association between task performance and judgment accuracy reported by Kruger and Dunning (1999) as a function of task difficulty. On easy tasks, where there is a positive bias, the best performers are also the most accurate in estimating their standing, but on difficult tasks, where there is a negative bias, the worst performers are the most accurate. This pattern is consistent with a combination of noisy estimates and overall bias, with no need to invoke differences in metacognitive abilities. In this  regard, our findings support Krueger and Mueller’s (2002) reinterpretation of Kruger and Dunning’s (1999) findings. An association between task-related skills and metacognitive insight may indeed exist, and later we offer some suggestions for ways to test for it. However, our analyses indicate that the primary drivers of errors in judging relative standing are general inaccuracy and overall biases tied to task difficulty. Thus, it is important to know more about those sources of error in order to better understand and ameliorate them.

What should we conclude from these (and other) studies? I think the jury’s still out to some extent, but at minimum, I think it’s clear that much of the Dunning-Kruger effect reflects either statistical artifact (regression to the mean), or much more general cognitive biases (the tendency to self-enhance and/or to use one’s subjective experience as a guide to one’s standing in relation to others). This doesn’t mean that the meta-cognitive explanation preferred by Dunning, Kruger and colleagues can’t hold in some situations; it very well may be that in some cases, and to some extent, people’s lack of skill is really what prevents them from accurately determining their standing in relation to others. But I think our default position should be to prefer the alternative explanations I’ve discussed above, because they’re (a) simpler, (b) more general (they explain lots of other phenomena), and (c) necessary (frankly, it’d be amazing if regression to the mean didn’t explain at least part of the effect!).

We should also try to be aware of another very powerful cognitive bias whenever we use the Dunning-Kruger effect to explain the people or situations around us–namely, confirmation bias. If you believe that incompetent people don’t know enough to know they’re incompetent, it’s not hard to find anecdotal evidence for that; after all, we all know people who are both arrogant and not very good at what they do. But if you stop to look for it, it’s probably also not hard to find disconfirming evidence. After all, there are clearly plenty of people who are good at what they do, but not nearly as good as they think they are (i.e., they’re above average, and still totally miscalibrated in the positive direction). Just like there are plenty of people who are lousy at what they do and recognize their limitations (e.g., I don’t need to be a great runner in order to be able to tell that I’m not a great runner–I’m perfectly well aware that I have terrible endurance, precisely because I can’t finish runs that most other runners find trivial!). But the plural of anecdote is not data, and the data appear to be equivocal. Next time you’re inclined to chalk your obnoxious co-worker’s delusions of grandeur down to the Dunning-Kruger effect, consider the possibility that your co-worker’s simply a jerk–no meta-cognitive incompetence necessary.

ResearchBlogging.orgKruger J, & Dunning D (1999). Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of personality and social psychology, 77 (6), 1121-34 PMID: 10626367
Krueger J, & Mueller RA (2002). Unskilled, unaware, or both? The better-than-average heuristic and statistical regression predict errors in estimates of own performance. Journal of personality and social psychology, 82 (2), 180-8 PMID: 11831408
Burson KA, Larrick RP, & Klayman J (2006). Skilled or unskilled, but still unaware of it: how perceptions of difficulty drive miscalibration in relative comparisons. Journal of personality and social psychology, 90 (1), 60-77 PMID: 16448310

will trade two Methods sections for twenty-two subjects worth of data

The excellent and ever-candid Candid Engineer in Academia has an interesting post discussing the love-hate relationship many scientists who work in wet labs have with benchwork. She compares two very different perspectives:

She [a current student] then went on to say that, despite wanting to go to grad school, she is pretty sure she doesn’t want to continue in academia beyond the Ph.D. because she just loves doing the science so much and she can’t imagine ever not being at the bench.

Being young and into the benchwork, I remember once asking my grad advisor if he missed doing experiments. His response: “Hell no.” I didn’t understand it at the time, but now I do. So I wonder if my student will always feel the way she does now- possessing of that unbridled passion for the pipet, that unquenchable thirst for the cell culture hood.

Wet labs are pretty much nonexistent in psychology–I’ve never had to put on gloves or goggles to do anything that I’d consider an “experiment”, and I’ve certainly never run the risk of  spilling dangerous chemicals all over myself–so I have no opinion at all about benchwork. Maybe I’d love it, maybe I’d hate it; I couldn’t tell you. But Candid Engineer’s post did get me thinking about opinions surrounding the psychological equivalent of benchwork–namely, collecting data form human subjects. My sense is that there’s somewhat more consensus among psychologists, in that most of us don’t seem to like data collection very much. But there are plenty of exceptions, and there certainly are strong feelings on both sides.

More generally, I’m perpetually amazed at the wide range of opinions people can hold about the various elements of scientific research, even when the people doing the different-opinion-holding all work in very similar domains. For instance, my favorite aspect of the research I do, hands down, is data analysis. I’d be ecstatic if I could analyze data all day and never have to worry about actually communicating the results to anyone (though I enjoy doing that too). After that, there are activities like writing and software development, which I spend a lot of time doing, and occasionally enjoy, but also frequently find very frustrating. And then, at the other end, there are aspects of research that I find have little redeeming value save for their instrumental value in supporting other, more pleasant, activities–nasty, evil activities like writing IRB proposals and, yes, collecting data.

To me, collecting data is something you do because you’re fundamentally interested in some deep (or maybe not so deep) question about how the mind works, and the only way to get an answer is to actually interrogate people while they do stuff in a controlled environment. It isn’t something I do for fun. Yet I know people who genuinely seem to love collecting data–or, for that matter, writing Methods sections or designing new experiments–even as they loathe perfectly pleasant activities like, say, sitting down to analyze the data they’ve collected, or writing a few lines of code that could save them hours’ worth of manual data entry. On a personal level, I find this almost incomprehensible: how could anyone possibly enjoy collecting data more than actually crunching the numbers and learning new things? But I know these people exist, because I’ve talked to them. And I recognize that, from their perspective, I’m the guy with the strange views. They’re sitting there thinking: what kind of joker actually likes to turn his data inside out several dozen times? What’s wrong with just running a simple t-test and writing up the results as fast as possible, so you can get back to the pleasure of designing and running new experiments?

This of course leads us directly to the care bears fucking tea party moment where I tell you how wonderful it is that we all have these different likes and dislikes. I’m not being sarcastic; it really is great. Ultimately, it works to everyone’s advantage that we enjoy different things, because it means we get to collaborate on projects and take advantage of complementary strengths and interests, instead of all having to fight over who gets to write the same part of the Methods section. It’s good that there are some people who love benchwork and some people who hate it, and it’s good that there are people who’re happy to write software that other people who hate writing software can use. We don’t all have to pretend we understand each other; it’s enough just to nod and smile and say “but of course you can write the Methods for that paper; I really don’t mind. And yes, I guess I can run some additional analyses for you, really, it’s not too much trouble at all.”

a possible link between pesticides and ADHD

A forthcoming article in the journal Pediatrics that’s been getting a lot of press attention suggests that exposure to common pesticides may be associated with a substantially elevated risk of ADHD. More precisely, what the study found was that elevated urinary concentrations of organophosphate metabolites were associated with an increased likelihood of meeting criteria for an ADHD diagnosis. One of the nice things about this study is that the authors used archival data from the (very large) National Health and Nutrition Examination Survey (NHANES), so they were able to control for a relatively broad range of potential confounds (e.g., gender, age, SES, etc.). The primary finding is, of course, still based on observational data, so you wouldn’t necessarily want to conclude that exposure to pesticides causes ADHD. But it’s a finding that converges with previous work in animal models demonstrating that high exposure to organophosphate pesticides causes neurodevelopmental changes, so it’s by no means a crazy hypothesis.

I think it’s really pleasantly surprising to see how responsibly the popular press has covered this story (e.g., this, this, and this). Despite the obvious potential for alarmism, very few articles have led with a headline implying a causal link between pesticides and ADHD. They all say things like “associated with”, “tied to”, or “linked to”, which is exactly right. And many even explicitly mention the size of the effect in question–namely, approximately a 50% increase in risk of ADHD per 10-fold increase in concentration of pesticide metabolites. Given that most of the articles contain cautionary quotes from the study’s authors, I’m guessing the authors really emphasized the study’s limitations when dealing with the press, which is great. In any case, because the basic details of the study have already been amply described elsewhere (I thought this short CBS article was particularly good), I’ll just mention a few random thoughts here:

  • Often, epidemiological studies suffer from a gaping flaw in the sense that the more interesting causal story (and the one that prompts media attention) is far less plausible than other potential explanations (a nice example of this is the recent work on the social contagion of everything from obesity to loneliness). That doesn’t seem to be the case here. Obviously, there are plenty of other reasons you might get a correlation between pesticide metabolites and ADHD risk–for instance, ADHD is substantially heritable, so it could be that parents with a disposition to ADHD also have systematically different dietary habits (i.e., parental dispositions are a common cause of both urinary metabolites and ADHD status in children). But given the aforementioned experimental evidence, it’s not obvious that alternative explanations for the correlation are much more plausible than the causal story linking pesticide exposure to ADHD, so in that sense this is potentially a very important finding.
  • The use of a dichotomous dependent variable (i.e., children either meet criteria for ADHD or don’t; there are no shades of ADHD gray here) is a real problem in this kind of study, because it can make the resulting effects seem deceptively large. The intuitive way we think about the members of a category is to think in terms of prototypes, so that when you think about “ADHD” and “Not-ADHD” categories, you’re probably mentally representing an extremely hyperactive, inattentive child for the former, and a quiet, conscientious kid for the latter. If that’s your mental model, and someone comes along and tells you that pesticide exposure increases the risk of ADHD by 50%, you’re understandably going to freak out, because it’ll seem quite natural to interpret that as a statement that pesticides have a 50% chance of turning average kids into hyperactive ones. But that’s not the right way to think about it. In all likelihood, pesticides aren’t causing a small proportion of kids to go from perfectly average to completely hyperactive; instead, what’s probably happening is that the entire distribution is shifting over slightly. In other words, most kids who are exposed to pesticides (if we assume for the sake of argument that there really is a causal link) are becoming slightly more hyperactive and/or inattentive.
  • Put differently, what happens when you have a strict cut-off for diagnosis is that even small increases in underlying symptoms can result in a qualitative shift in category membership. If ADHD symptoms were measured on a continuous scale (which they actually probably were, before being dichotomized to make things simple and more consistent with previous work), these findings might have been reported as something like “a 10-fold increase in pesticide exposures is associated with a 2-point increase on a 30-point symptom scale,” which would have made it much clearer that, at worst, pesticides are only one of many other contributing factors to ADHD, and almost certainly not nearly as big a factor as some others. That’s not to say we shouldn’t be concerned if subsequent work supports a causal link, but just that we should retain perspective on what’s involved. No one’s suggesting that you’re going to feed your child an unwashed pear or two and end up with a prescription for Ritalin; the more accurate view would be that you might have a minority of kids who are already at risk for ADHD, and this would be just one more precipitating factor.
  • It’s also worth keeping in mind that the relatively large increase in ADHD risk is associated with a ten-fold increase in pesticide metabolites. As the authors note, that corresponds to the difference between the 25th and 75th percentiles in the sample. Although we don’t know exactly what that means in terms of real-world exposure to pesticides (because the authors didn’t have any data on grocery shopping or eating habits), it’s almost certainly a very sizable difference (I won’t get into the reasons why, except to note that the rank-order of pesticide metabolites must be relatively stable among children, or else there wouldn’t be any association with a temporally-extended phenotype like ADHD). So the point is, it’s probably not so easy to go from the 25th to the 75th percentile just by eating a few more fruits and vegetables here and there. So while it’s certainly advisable to try and eat better, and potentially to buy organic produce (if you can afford it), you shouldn’t assume that you can halve your child’s risk of ADHD simply by changing his or her diet slightly. These are, at the end of the day, small effects.
  • The authors report that fully 12% of children in this nationally representative sample met criteria for ADHD (mostly of the inattentive subtype). This, frankly, says a lot more about how silly the diagnostic criteria for ADHD are than about the state of the nation’s children. It’s frankly not plausible to suppose that 1 in 8 children really suffer from what is, in theory at least, a severe, potentially disabling disorder. I’m not trying to trivialize ADHD or argue that there’s no such thing, but simply to point out the dangers of medicalization. Once you’ve reached the point where 1 in every 8 people meet criteria for a serious disorder, the label is in danger of losing all meaning.

ResearchBlogging.orgBouchard, M., Bellinger, D., Wright, R., & Weisskopf, M. (2010). Attention-Deficit/Hyperactivity Disorder and Urinary Metabolites of Organophosphate Pesticides PEDIATRICS DOI: 10.1542/peds.2009-3058

in defense of three of my favorite sayings

Seth Roberts takes issue with three popular maxims that (he argues) people use “to push away data that contradicts this or that approved view of the world”. He terms this preventive stupidity. I’m a frequent user of all three sayings, so I suppose that might make me preventively stupid; but I do feel like I have good reasons for using these sayings, and I confess to not really seeing Roberts’ point.

Here’s what Roberts has to say about the three sayings in question:

1. Absence of evidence is not evidence of absence. Øyhus explains why this is wrong. That such an Orwellian saying is popular in discussions of data suggests there are many ways we push away inconvenient data.

In my own experience, by far the biggest reason this saying is popular in discussions of data (and the primary reason I use it when reviewing papers) is that many people have a very strong tendency to interpret null results as an absence of any meaningful effect. That’s a very big problem, because the majority of studies in psychology tend to have relatively little power to detect small to moderate-sized effects. For instance, as I’ve discussed here, most whole-brain analyses in typical fMRI samples (of say, 15 – 20 subjects) have very little power to detect anything but massive effects. And yet people routinely interpret a failure to detect hypothesized effects as an indication that they must not exist at all. The simplest and most direct counter to this type of mistake is to note that one shouldn’t accept the null hypothesis unless one has very good reasons to think that power is very high and effect size estimates are consequently quite accurate. Which is just another way of saying that absence of evidence is not evidence of absence.

2. Correlation does not equal causation. In practice, this is used to mean that correlation is not evidence for causation. At UC Berkeley, a job candidate for a faculty position in psychology said this to me. I said, “Isn’t zero correlation evidence against causation?“ She looked puzzled.

Again, Roberts’ experience clearly differs from mine; I’ve far more often seen this saying used as a way of suggesting that a researcher may be drawing overly strong causal conclusions from the data, not as a way of simply dismissing a correlation outright. A good example of this is found in the developmental literature, where many researchers have observed strong correlations between parents’ behavior and their children’s subsequent behavior. It is, of course, quite plausible to suppose that parenting behavior exerts a direct causal influence on children’s behavior, so that the children of negligent or abusive parents are more likely to exhibit delinquent behavior and grow up to perpetuate the “cycle of violence”. But this line of reasoning is substantially weakened by behavioral genetic studies indicating that very little of the correlation between parents’ and children’s personalities is explained by shared environmental factors, and that the vast majority reflects heritable influences and/or unique environmental influences. Given such findings, it’s a perfectly appropriate rebuttal to much of the developmental literature to note that correlation doesn’t imply causation.

It’s also worth pointing out that the anecdote Roberts provides isn’t exactly a refutation of the maxim; it’s actually an affirmation of the consequent. The fact that an absence of any correlation could potentially be strong evidence against causation (under the right circumstances) doesn’t mean that the presence of a correlation is strong evidence for causation. It may or may not be, but that’s something to be weighed on a case-by-case basis. There certainly are plenty of cases where it’s perfectly appropriate (and even called for) to remind someone that correlation doesn’t imply causation.

3. The plural of anecdote is not data. How dare you try to learn from stories you are told or what you yourself observe!

I suspect this is something of a sore spot for Roberts, who’s been an avid proponent of self-experimentation and case studies. I imagine people often dismiss his work as mere anecdote rather than valuable data. Personally, I happen to think there’s tremendous value to self-experimentation (at least when done in as controlled a manner as possible), so I don’t doubt there are many cases where this saying is unfairly applied. That said, I think Roberts fails to appreciate that people who do his kind of research constitute a tiny fraction of the population. Most of the time, when someone says that “the plural of anecdote is not data,” they’re not talking to someone who does rigorous self-experimentation, but to people who, say, don’t believe they should give up smoking seeing as how their grandmother smoked till she was 88 and died in a bungee-jumping accident, or who are convinced that texting while driving is perfectly acceptable because they don’t personally know anyone who’s gotten in an accident. In such cases, it’s not only legitimate but arguably desirable to point out that personal anecdote is no substitute for hard data.

Orwell was right. People use these sayings — especially #1 and #3 — to push away data that contradicts this or that approved view of the world. Without any data at all, the world would be simpler: We would simply believe what authorities tell us. Data complicates things. These sayings help those who say them ignore data, thus restoring comforting certainty.

Maybe there should be a term (antiscientific method?) to describe the many ways people push away data. Or maybe preventive stupidity will do.

I’d like to be charitable here, since there very clearly are cases where Roberts’ point holds true: sometimes people do toss out these sayings as a way of not really contending with data they don’t like. But frankly, the general claim that these sayings are antiscientific and constitute an act of stupidity just seems silly. All three sayings are clearly applicable in a large number of situations; to deny that, you’d have to believe that (a) it’s always fine to accept the null hypothesis, (b) correlation is always a good indicator of a causal relationship, and (c) personal anecdotes are just as good as large, well-controlled studies. I take it that no one, including Roberts, really believes that. So then it becomes a matter of when to apply these sayings, and not whether or not to use them. After all, it’d be silly to think that the people who use these sayings are always on the side of darkness, and the people who wield null results, correlations, and anecdotes with reckless abandon are always on the side of light.

My own experience, for what it’s worth, is that the use of these sayings is justified far more often than not, and I don’t have any reservation applying them myself when I think they’re warranted (which is relatively often–particularly the first one). But I grant that that’s just my own personal experience talking, and no matter how many experiences I’ve had of people using these sayings appropriately, I’m well aware that the plural of anecdote…

de Waal and Ferrari on cognition in humans and animals

Humans do many things that most animals can’t. That much no one would dispute. The more interesting and controversial question is just how many things we can do that most animals can’t, and just how many animal species can or can’t do the things we do. That question is at the center of a nice opinion piece in Trends in Cognitive Sciences by Frans de Waal and Pier Francisco Ferrari.

De Waal and Ferrari argue for what they term a bottom-up approach to human and animal cognition. The fundamental idea–which isn’t new, and in fact owes much to decades of de Waal’s own work with primates–is that most of our cognitive abilities, including many that are often characterized as uniquely human, are in fact largely continuous with abilities found in other species. De Waal and Ferrari highlight a number of putatively “special” functions like imitation and empathy that turn out to have relatively frequent primate (and in some cases non-primate) analogs. They push for a bottom-up scientific approach that seeks to characterize the basic mechanisms that complex functionality might have arisen out of, rather than (what they see as) “the overwhelming tendency outside of biology to give human cognition special treatment.”

Although I agree pretty strongly with the thesis of the paper, its scope is also, in some ways, quite limited: De Waal and Ferrari clearly believe that many complex functions depend on homologous mechanisms in both humans and non-human primates, but they don’t actually say very much about what these mechanisms might be, save for some brief allusions to relatively broad neural circuits (e.g., the oft-criticized mirror neuron system, which Ferrari played a central role in identifying and characterizing). To some extent that’s understandable given the brevity of TICS articles, but given how much de Waal has written about primate cognition, it would have been nice to see a more detailed example of the types of cognitive representations de Waal thinks underlie, say, the homologous abilities of humans and capuchin monkeys empathize with conspecifics.

Also, despite its categorization as an “Opinion” piece (these are supposed to stir up debate), I don’t think many people (at least, the kind of people who read TICS articles) are going to take issue with the basic continuity hypothesis advanced by de Waal and Ferrari. I suspect many more people would agree than disagree with the notion that most complex cognitive abilities displayed by humans share a closely intertwined evolutionary history with seemingly less sophisticated capacities displayed by primates and other mammalian species. So in that sense, de Waal and Ferrari might be accused of constructing something of a straw man. But it’s important to recognize that de Waal’s own work is a very large part of the reason why the continuity hypothesis is so widely accepted these days. So in that sense, even if you already agree with its premise, the TICS paper is worth reading simply as an elegant summary of a long-standing and important line of research.

more on the absence of brain training effects

A little while ago I blogged about the recent Owen et al Nature study on the (null) effects of cognitive training. My take on the study, which found essentially no effect of cognitive training on generalized cognitive performance, was largely positive. In response, Martin Walker, founder of Mind Sparke, maker of Brain Fitness Pro software, left this comment:

I’ve done regular aerobic training for pretty much my entire life, but I’ve never had the kind of mental boost from exercise that I have had from dual n-back training. I’ve also found that n-back training helps my mood.

There was a foundational problem with the BBC study in that it didn’t provide anywhere near the intensity of training that would be required to show transfer. The null hypothesis was a forgone conclusion. It seems hard to believe that the scientists didn’t know this before they began and were setting out to debunk the populist brain game hype.

I think there are a couple of points worth making. One is the standard rejoinder that one anecdotal report doesn’t count for very much. That’s not meant as a jibe at Walker in particular, but simply as a general observation about the fallibility of human judgment. Many people are perfectly convinced that homeopathic solutions have dramatically improved their quality of life, but that doesn’t mean we should take homeopathy seriously. Of course, I’m not suggesting that cognitive training programs are as ineffectual as homeopathy–in my post, I suggested they may well have some effect–but simply that personal testimonials are no substitute for controlled studies.

With respect to the (also anecdotal) claim that aerobic exercise hasn’t worked for Walker, it’s worth noting that the effects of aerobic exercise on cognitive performance take time to develop. No one expects a single brisk 20-minute jog to dramatically improve cognitive performance. If you’ve been exercising regularly your whole life, the question isn’t whether exercise will improve your cognitive function–it’s whether not doing any exercise for a month or two would lead to poorer performance. That is, if Walker stopped exercising, would his cognitive performance suffer? It would be a decidedly unhealthy hypothesis to test, of course, but that would really be the more reasonable prediction. I don’t think anyone thinks that a person in excellent physical condition would benefit further from physical exercise; the point is precisely that most people aren’t in excellent physical shape. In any event, as I noted in my post, the benefits of aerobic exercise are clearly largest for older adults who were previously relatively sedentary. There’s much less evidence for large effects of aerobic exercise on cognitive performance in young or middle-aged adults.

The more substantive question Walker raises has to do with whether the tasks Owen et al used were too easy to support meaningful improvement. I think this is a reasonable question, but I don’t think the answer is as straightforward as Walker suggests. For one thing, participants in the Owen et al study did show substantial gains in performance on the training tasks (just not the untrained tasks), so it’s not like they were at ceiling. That is, the training tasks clearly weren’t easy. Second, participants varied widely in the number of training sessions they performed, and yet, as the authors note, the correlation between amount of training and cognitive improvement was negligible. So if you extrapolate from the observed pattern, it doesn’t look particularly favorable. Third, Owen et al used 12 different training tasks that spanned a broad range of cognitive abilities. While one can quibble with any individual task, it’s hard to reconcile the overall pattern of null results with the notion that cognitive training produces robust effects. Surely at least some of these measures should have led to a noticeable overall effect if they successfully produced transfer. But they didn’t.

To reiterate what I said in my earlier post, I’m not saying that cognitive training has absolutely no effect. No study is perfect, and it’s conceivable that more robust effects might be observed given a different design. But the Owen et al study is, to put it bluntly, the largest study of cognitive training conducted to date by about two orders of magnitude, and that counts for a lot in an area of research dominated by relatively small studies that have generally produced mixed findings. So, in the absence of contradictory evidence from another large training study, I don’t see any reason to second-guess Owen et al’s conclusions.

Lastly, I don’t think Walker is in any position to cast aspersions on people’s motivations (“It seems hard to believe that the scientists didn’t know this before they began and were setting out to debunk the populist brain game hype”). While I don’t think that his financial stake in brain training programs necessarily impugns his evaluation of the Owen et al study, it can’t exactly promote impartiality either. And for what it’s worth, I dug around the Mind Sparke website and couldn’t find any “scientific proof” that the software works (which is what the website claims)–just some vague allusions to customer testimonials and some citations of other researchers’ published work (none of which, as far as I can tell, used Brain Fitness Pro for training).

undergraduates are WEIRD

This month’s issue of Nature Neuroscience contains an editorial lambasting the excessive reliance of psychologists on undergraduate college samples, which, it turns out, are pretty unrepresentative of humanity at large. The impetus for the editorial is a mammoth in-press review of cross-cultural studies by Joseph Henrich and colleagues, which, the authors suggest, collectively indicate that “samples drawn from Western, Educated, Industrialized, Rich and Democratic (WEIRD) societies … are among the least representative populations one could find for generalizing about humans.” I’ve only skimmed the article, but aside from the clever acronym, you could do a lot worse than these (rather graphic) opening paragraphs:

In the tropical forests of New Guinea the Etoro believe that for a boy to achieve manhood he must ingest the semen of his elders. This is accomplished through ritualized rites of passage that require young male initiates to fellate a senior member (Herdt, 1984; Kelley, 1980). In contrast, the nearby Kaluli maintain that  male initiation is only properly done by ritually delivering the semen through the initiate’s anus, not his mouth. The Etoro revile these Kaluli practices, finding them disgusting. To become a man in these societies, and eventually take a wife, every boy undergoes these initiations. Such boy-inseminating practices, which  are enmeshed in rich systems of meaning and imbued with local cultural values, were not uncommon among the traditional societies of Melanesia and Aboriginal Australia (Herdt, 1993), as well as in Ancient Greece and Tokugawa Japan.

Such in-depth studies of seemingly “exotic“ societies, historically the province of anthropology, are crucial for understanding human behavioral and psychological variation. However, this paper is not about these peoples. It’s about a truly unusual group: people from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies. In particular, it’s about the Western, and more specifically American, undergraduates who form the bulk of the database in the experimental branches of psychology, cognitive science, and economics, as well as allied fields (hereafter collectively labeled the “behavioral sciences“). Given that scientific knowledge about human psychology is largely based on findings from this subpopulation, we ask just how representative are these typical subjects in light of the available comparative database. How justified are researchers in assuming a species-level generality for their findings? Here, we review the evidence regarding how WEIRD people compare to other
populations.

Anyway, it looks like a good paper. Based on a cursory read, the conclusions the authors draw seem pretty reasonable, if a bit strong. I think most researchers do already recognize that our dependence on undergraduates is unhealthy in many respects; it’s just that it’s difficult to break the habit, because the alternative is to spend a lot more time and money chasing down participants (and there are limits to that too; it just isn’t feasible for most researchers to conduct research with Etoro populations in New Guinea). Then again, just because it’s hard to do science the right way doesn’t really make it OK to do it the wrong way. So, to the extent that we care about our results generalizing across the entire human species (which, in many cases, we don’t), we should probably be investing more energy in weaning ourselves off undergraduates and trying to recruit more diverse samples.

cognitive training doesn’t work (much, if at all)

There’s a beautiful paper in Nature this week by Adrian Owen and colleagues that provides what’s probably as close to definitive evidence as you can get in any single study that “brain training” programs don’t work. Or at least, to the extent that they do work, the effects are so weak they’re probably not worth caring about.

Owen et al used a very clever approach to demonstrate their point. Rather than spending their time running small-sample studies that require people to come into the lab over multiple sessions (an expensive and very time-intensive effort that’s ultimately still usually underpowered), they teamed up with the BBC program ‘Bang Goes The Theory‘. Participants were recruited via the tv show, and were directed to an experimental website where they created accounts, engaged in “pre-training” cognitive testing, and then could repeatedly log on over the course of six weeks to perform a series of cognitive tasks supposedly capable of training executive abilities. After the training period, participants again performed the same battery of cognitive tests, enabling the researchers to compare performance pre- and post-training.

Of course, you expect robust practice effects with this kind of thing (i.e., participants would almost certainly do better on the post-training battery than on the pre-training battery solely because they’d been exposed to the tasks and had some practice). So Owen et al randomly assigned participants logging on to the website to two different training programs (involving different types of training tasks) or to a control condition in which participants answered obscure trivia questions rather than doing any sort of intensive cognitive training per se. The beauty of doing this all online was that the authors were able to obtain gargantuan sample sizes (several thousand in each condition), ensuring that statistical power wasn’t going to be an issue. Indeed, Owen et al focus almost explicitly on effect sizes rather than p values, because, as they point out, once you have several thousand participants in each group, almost everything is going to be statistically significant, so it’s really the effect sizes that matter.

The critical comparison was whether the experimental groups showed greater improvements in performance post-training than the control group did. And the answer, generally speaking, was no. Across four different tasks, the differences in training-related gains in the experimental group relative to the control group were always either very small (no larger than about a fifth of a standard deviation), or even nonexistent (to the extent that for some comparisons, the control group improved more than the experimental groups!). So the upshot is that if there is any benefit of cognitive training (and it’s not at all clear that there is, based on the data), it’s so small that it’s probably not worth caring about. Here’s the key figure:

owen_et_al

You could argue that the fact the y-axis spans the full range of possible values (rather than fitting the range of observed variation) is a bit misleading, since it’s only going to make any effects seem even smaller. But even so, it’s pretty clear these are not exactly large effects (and note that the key comparison is not the difference between light and dark bars, but the relative change from light to dark across the different groups).

Now, people who are invested (either intellectually or financially) in the efficacy of cognitive training programs might disagree, arguing that an effect of one-fifth of a standard deviation isn’t actually a tiny effect, and that there are arguably many situations in which that would be a meaningful boost in performance. But that’s the best possible estimate, and probably overstates the actual benefit. And there’s also the opportunity cost to consider: the average participant completed 20 – 30 training sessions, which, even at just 20 minutes a session (an estimate based on the description of the length of each of the training tasks), would take about 8 – 10 hours to complete (and some participants no doubt spent many more hours in training).  That’s a lot of time that could have been invested in other much more pleasant things, some of which might also conceivably improve cognitive ability (e.g., doing Sudoku puzzles, which many people actually seem to enjoy). Owen et al put it nicely:

To illustrate the size of the transfer effects observed in this study, consider the following representative example from the data. The increase in the number of digits that could be remembered following training on tests designed, at least in part, to improve memory (for example, in experimental group 2) was three-hundredth of a digit. Assuming a linear relationship between time spent training and improvement, it would take almost four years of training to remember one extra digit. Moreover, the control group improved by two-tenths of a digit, with no formal memory training at all.

If someone asked you if you wanted to spend six weeks doing a “brain training” program that would provide those kinds of returns, you’d probably politely (or impolitely) refuse. Especially since it’s not like most of us spend much of our time doing digit span tasks anyway; odds are that the kinds of real-world problems we’d like to perform a little better at (say, something trivial like figuring out what to buy or not to buy at the grocery store) are even further removed from the tasks Owen et al (and other groups) have used to test for transfer, so any observable benefits in the real world would presumably be even smaller.

Of course, no study is perfect, and there are three potential concerns I can see. The first is that it’s possible that there are subgroups within the tested population who do benefit much more from the cognitive training. That is, the miniscule overall effect could be masking heterogeneity within the sample, such that some people (say, maybe men above 60 with poor diets who don’t like intellectual activities) benefit much more. The trouble with this line of reasoning, though, is that the overall effects in the entire sample are so small that you’re pretty much forced to conclude that either (a) any group that benefits substantially from the training is a very small proportion of the total sample, or (b) that there are actually some people who suffer as a result of cognitive training, effectively balancing out the gains seen by other people. Neither of these possibilities seem particularly attractive.

The second concern is that it’s conceivable that the control group isn’t perfectly matched to the experimental group, because, by the authors’ own admission, the retention rate was much lower in the control group. Participants were randomly assigned to the three groups, but only about two-thirds as many control participants completed the study. The higher drop-out rate was apparently due to the fact that the obscure trivia questions used as a control task were pretty boring. The reason that’s a potential problem is that attrition wasn’t random, so there may be a systematic difference between participants in the experimental conditions and those in the control conditions. In particular, it’s possible that the remaining control participants had a higher tolerance for boredom and/or were somewhat smarter or more intellectual on average (answering obscure trivia questions clearly isn’t everyone’s cup of tea). If that were true, the lack of any difference between experimental and control conditions might be due to participant differences rather than an absence of a true training effect. Unfortunately, it’s hard to determine whether this might be true, because (as far as I can tell) Owen et al don’t provide the raw mean performance scores on the pre- and post-training testing for each group, but only report the changes in performance. What you’d want to know is that the control participants didn’t do substantially better or worse on the pre-training testing than the experimental participants (due to selective attrition of low-performing subjects), which might make changes in performance difficult to interpret. But at face value, it doesn’t seem very plausible that this would be a serious issue.

Lastly, Owen et al do report a small positive correlation between number of training sessions performed (which was under participants’ control) and gains in performance on the post-training test. Now, this effect was, as the authors note, very small (a maximal Spearman’s rho of .06), so that it’s also not really likely to have practical implications. Still, it does suggest that performance increases as a function of practice. So if we’re being pedantic, we should say that intensive cognitive training may improve cognitive performance in a generalized way, but that the effect is really minuscule and probably not worth the time and effort required to do the training in the first place. Which isn’t exactly the type of careful and measured claim that the people who sell brain training programs are generally interested in making.

At any rate, setting aside the debate over whether cognitive training works or not, one thing that’s perplexed me for a long time about the training literature is why people focus to such an extent on cognitive training rather than other training regimens that produce demonstrably larger transfer effects. I’m thinking in particular of aerobic exercise, which produces much more robust and replicable effects on cognitive performance. There’s a nice meta-analysis by Colcombe and colleagues that found effect sizes on the order of half a standard deviation and up for physical exercise in older adults–and effects were particularly large for the most heavily g-loaded tasks. Now, even if you allow for publication bias and other manifestations of the fudge factor, it’s almost certain that the true effect of physical exercise on cognitive performance is substantially larger than the (very small) effects of cognitive training as reported by Owen et al and others.

The bottom line is that, based on everything we know at the moment, the evidence seems to pretty strongly suggest that if your goal is to improve cognitive function, you’re more likely to see meaningful results by jogging or swimming regularly than by doing crossword puzzles or N-back tasks–particularly if you’re older. And of course, a pleasant side effect is that exercise also improves your health and (for at least some people) mood, which I don’t think N-back tasks do. Actually, many of the participants I’ve tested will tell you that doing the N-back is a distinctly dysphoric experience.

On a completely unrelated note, it’s kind of neat to see a journal like Nature publish what is essentially a null result. It goes to show that people do care about replication failures in some cases–namely, in those cases when the replication failure contradicts a relatively large existing literature, and is sufficiently highly powered to actually say something interesting about the likely effect sizes in question.

ResearchBlogging.org
Owen AM, Hampshire A, Grahn JA, Stenton R, Dajani S, Burns AS, Howard RJ, & Ballard CG (2010). Putting brain training to the test. Nature PMID: 20407435

what’s adaptive about depression?

Jonah Lehrer has an interesting article in the NYT magazine about a recent Psych Review article by Paul Andrews and J. Anderson Thomson. The basic claim Andrews and Thomson make in their paper is that depression is “an adaptation that evolved as a response to complex problems and whose function is to minimize disruption of rumination and sustain analysis of complex problems”. Lehrer’s article is, as always, engaging, and he goes out of his way to obtain some critical perspectives from other researchers not affiliated with Andrews & Thomson’s work. It’s definitely worth a read.

In reading Lehrer’s article and the original paper, two things struck me. One is that I think Lehrer slightly exaggerates the novelty of Andrews and Thomson’s contribution. The novel suggestion of their paper isn’t that depression can be adaptive under the right circumstances (I think most people already believe that, and as Lehrer notes, the idea traces back a long way); it’s that the specific adaptive purpose of depression is to facilitate solving of complex problems. I think Andrews and Thomson’s paper received a somewhat critical reception (which Lehrer discusses) not so much because people found the suggestion that depression might be adaptive objectionable, but because there are arguably more plausible things depression could have been selected for. Lehrer mentions a few:

Other scientists, including Randolph Nesse at the University of Michigan, say that complex psychiatric disorders like depression rarely have simple evolutionary explanations. In fact, the analytic-rumination hypothesis is merely the latest attempt to explain the prevalence of depression. There is, for example, the “plea for help“ theory, which suggests that depression is a way of eliciting assistance from loved ones. There’s also the “signal of defeat“ hypothesis, which argues that feelings of despair after a loss in social status help prevent unnecessary attacks; we’re too busy sulking to fight back. And then there’s “depressive realism“: several studies have found that people with depression have a more accurate view of reality and are better at predicting future outcomes. While each of these speculations has scientific support, none are sufficient to explain an illness that afflicts so many people. The moral, Nesse says, is that sadness, like happiness, has many functions.

Personally, I find these other suggestions more plausible than the Andrews and Thomson story (if still not terribly compelling). There are a variety of reasons for this (see Jerry Coyne’s twin posts for some of them, along with the many excellent comments), but one pretty big one is that is that they’re all at least somewhat more consistent with a continuity hypothesis under which many of the selection pressures that influenced the structure of the human mind have been at work in our lineage for millions of years. That’s to say, if you believe in a “signal of defeat” account, you don’t have to come up with complex explanations for why human depression is adaptive (the problem being that other mammals don’t seem to show an affinity for ruminating over complex analytical problems); you can just attribute depression to much more general selection pressures found in other animals as well.

One hypothesis I particularly like in this respect, related to the signal-of-defeat account, is that depression is essentially just a human manifestation of a general tendency toward low self-confidence and aggression. The value of low self-confidence is pretty obvious: you don’t challenge the alpha male, so you don’t get into fights; you only chase prey you think you can safely catch; and so on. Now suppose humans inherited this basic architecture from our ancestral apes. In human societies there’s still a clear potential benefit to being subservient and non-confrontational; it’s a low-risk, low-reward strategy. If you don’t bother anyone, you’re probably not going to get the girl impress the opposite sex very much, but at least you won’t get clubbed over the head by a competitor very often. So there’s a sensible argument to be made for frequency dependent selection for depression-related traits (the reason it’s likely to be frequency dependent is that if you ever had a population made up entirely of self-doubting, non-aggressive individuals, being more aggressive would probably become highly advantageous, so at some point, you’d achieve a stable equilibrium).

So where does rumination–the main focus of the Andrews and Thomson paper–come into the picture? Well, I don’t know for sure, but here’s a pretty plausible just-so story: once you evolve the capacity to reason intelligently about yourself, you now have a higher cognitive system that’s naturally going to want to understand why it feels the way it does so often. If you’re someone who feels pretty upset about things much of the time, you’re going to think about those things a lot. So… you ruminate. And that’s really all you need! Saying that depression is adaptive doesn’t require you to think of every aspect of depression (e.g., rumination) as a complex and human-specific adaptation; it seems more parsimonious to see depressive rumination as a non-adaptive by-product of a more general and (potentially) adaptive disposition to experience negative affect.  On this type of account, ruminating isn’t actually helping a depressed person solve any problems at all. In fact, you could even argue that rumination shouldn’t make you feel better, or it would defeat the very purpose of having a depressive nature in the first place. In other words, it’s entirely consistent with the basic argument that depression is adaptive under some circumstances that the very purpose of rumination might be to keep depressed people in a depressed state. I don’t have any direct evidence for this, of course; it’s a just-so story. But it’s one that is, in my opinion (a) more plausible and (b) more consistent with indirect evidence (e.g., that rumination generally doesn’t seem to make people feel better!) than the Andrews and Thomson view.

The other thing that struck me about the Andrews and Thomson paper, and to a lesser extent, Lehrer’s article, is that the focus is (intentionally) squarely on whether and why depression is adaptive from an evolutionary standpoint. But it’s not clear that the average person suffering from depression really cares, or should care, about whether their depression exists for some distant evolutionary reason. What’s much more germane to someone suffering from depression is whether their depression is actually increasing their quality of life, and in that respect, it’s pretty difficult to make a positive case. The argument that rumination is adaptive because it helps you solve complex analytical problems is only compelling if you think that those problems are really worth mulling over deeply in the first place. For most of the things that depressed people tend to ruminate over (most of which aren’t life-changing decisions, but trivial things like whether your co-workers hate you because of the unfashionable shirt you wore to work yesterday), that just doesn’t seem to be the case. So the argument becomes circular: rumination helps you solve problems that a happier person probably wouldn’t have been bothered by in the first place. Now, that isn’t to say that there aren’t some very specific environments in which depression might still be adaptive today; it’s just that there don’t seem to be very many of them. If you look at the data, it’s quite clear that, on average, depression has very negative effects. People lose friends, jobs, and the joy of life because of their depression; it’s hard to see what monumental problem-solving insight could possibly compensate for that in most cases. By way of analogy, saying that depression is adaptive because it promotes rumination seems kind of like saying that cigarettes serve an adaptive purpose because they make nicotine withdrawal go away. Well, maybe. But wouldn’t you rather not have the withdrawal symptoms to begin with?

To be clear, I’m not suggesting that we should view depression solely in pathological terms, and should entirely write off the possibility that there are some potentially adaptive aspects to depression (or personality traits that go along with it). Rather, the point is that, if you’re suffering from depression, it’s not clear what good it’ll do you to learn that some of your ancestors may have benefited from their depressive natures. (By the same token, you wouldn’t expect a person suffering from sickle-cell anemia to gain much comfort from learning that they carry two copies of a mutation that, in a heterozygous carrier, would confer a strong resistance to malaria.) Conversely, there’s a very real danger here, in the sense that, if Andrews and Thomson are wrong about rumination being adaptive, they might be telling people it’s OK to ruminate when in fact excessive rumination could be encouraging further depression. My sense is that that’s actually the received wisdom right now (i.e., much of cognitive-behavioral therapy is focused on getting depressed individuals to recognize their ruminative cycles and break out of them). So the concern is that too much publicity might be a bad thing in this case, and, far from heralding the arrival of a new perspective on the conceptualization and treatment of depression, may actually be hurting some people. Ultimately, of course, it’s an empirical matter, and certainly not one I have any conclusive answers to. But what I can quite confidently assert in the meantime is that the Lehrer article is an enjoyable read, so long as you read it with a healthy dose of skepticism.

ResearchBlogging.org
Andrews, P., & Thomson, J. (2009). The bright side of being blue: Depression as an adaptation for analyzing complex problems. Psychological Review, 116 (3), 620-654 DOI: 10.1037/a0016242

internet use causes depression! or not.

I have a policy of not saying negative things about people (or places, or things) on this blog, and I think I’ve generally been pretty good about adhering to that policy. But I also think it’s important for scientists to speak up in cases where journalists or other scientists misrepresent scientific research in a way that could have a potentially large impact on people’s behavior, and this is one of those cases. All day long, media outlets have been full of reports about a new study that purportedly reveals that the internet–that most faithful of friends, always just a click away with its soothing, warm embrace–has a dark side: using it makes you depressed!

In fairness, most of the stories have been careful to note that the  study only “links” heavy internet use to depression, without necessarily implying that internet use causes depression. And the authors acknowledge that point themselves:

“While many of us use the Internet to pay bills, shop and send emails, there is a small subset of the population who find it hard to control how much time they spend online, to the point where it interferes with their daily activities,” said researcher Dr. Catriona Morrison, of the University of Leeds, in a statement. “Our research indicates that excessive Internet use is associated with depression, but what we don’t know is which comes first. Are depressed people drawn to the Internet or does the Internet cause depression?”

So you might think all’s well in the world of science and science journalism. But in other places, the study’s authors weren’t nearly so circumspect. For example, the authors suggest that 1.2% of the population can be considered addicted to the internet–a rate they claim is double that of compulsive gambling; and they suggest that their results “feed the public speculation that overengagement in websites that serve/replace a social function might be linked to maladaptive psychological functioning,” and “add weight to the recent suggestion that IA should be taken seriously as a distinct psychiatric construct.”

These are pretty strong claims; if the study’s findings are to be believed, we should at least be seriously considering the possibility that using the internet is making some of us depressed. At worst, we should be diagnosing people with internet addiction and doing… well, presumably something to treat them.

The trouble is that it’s not at all clear that the study’s findings should be believed. Or at least, it’s not clear that they really support any of the statements made above.

Let’s start with what the study (note: restricted access) actually shows. The authors, Catriona Morrison and Helen Gore (M&G), surveyed 1,319 subjects via UK-based social networking sites. They had participants fill out 3 self-report measures: the Internet Addiction Test (IAT), which measures dissatisfaction with one’s internet usage; the Internet Function Questionnaire, which asks respondents to indicate the relative proportion of time they spend on different internet activities (e.g., e-mail, social networking, porn, etc.); and the Beck Depression Inventory (BDI), a very widely-used measure of depression.

M&G identify a number of findings, three of which appear to support most of their conclusions. First, they report a very strong positive correlation (r = .49) between internet addiction and depression scores; second, they identify a small group of 18 subjects (1.2%) who they argue qualify as internet addicts (IA group) based on their scores on the IAT; and third, they suggest that people who used the internet more heavily “spent proportionately more time on online gaming sites, sexually gratifying websites, browsing, online communities and chat sites.”

These findings may sound compelling, but there are a number of methodological shortcomings of the study that make them very difficult to interpret in any meaningful way. As far as I can tell, none of these concerns are addressed in the paper:

First, participants were recruited online, via social networking sites. This introduces a huge selection bias: you can’t expect to obtain accurate estimates of how much, and how adaptively, people use the internet by sampling only from the population of internet users! It’s the equivalent of trying to establish cell phone usage patterns by randomly dialing only land-line numbers. Not a very good idea. And note that, not only could the study not reach people who don’t use the internet, but it was presumably also more likely to oversample from heavy internet users. The more time a person spends online, the greater the chance they’d happen to run into the authors recruitment ad. People who only check their email a couple of times a week would be very unlikely to participate in the study. So the bottom line is, the 1.2% figure the authors arrive at is almost certainly a gross overestimate. The true proportion of people who meet the authors’ criteria for internet addiction is probably much lower. It’s hard to believe the authors weren’t aware of the issue of selection bias, and the massive problem it presents for their estimates, yet they failed to mention it anywhere in their paper.

Second, the cut-off score for being placed in the IA group appears to be completely arbitrary. The Internet Addiction Test itself was developed by Kimberly Young in a 1998 book entitled “Caught in the Net: How to Recognize the Signs of Internet Addiction–and a Winning Strategy to Recovery”. The test was introduced, as far as I can tell (I haven’t read the entire book, just skimmed it in Google Books), with no real psychometric validation. The cut-off of 80 points out of a maximum 100 possible as a threshold for addiction appears to be entirely arbitrary (in fact, in Young’s book, she defines the cut-off as 70; for reasons that are unclear, M&G adopted a cut-off of 80). That is, it’s not like Young conducted extensive empirical analysis and determined that people with scores of X or above were functionally impaired in a way that people with scores below X weren’t; by all appearances, she simply picked numerically convenient cut-offs (20 – 39 is average; 40 – 69 indicates frequent problems; and 70+ basically means the internet is destroying your life). Any small change in the numerical cut-off would have translated into a large change in the proportion of people in M&G’s sample who met criteria for internet addiction, making the 1.2% figure seem even more arbitrary.

Third, M&G claim that the Internet Function Questionnaire they used asks respondents to indicate the proportion of time on the internet that they spend on each of several different activities. For example, given the question “How much of your time online do you spend on e-mail?”, your options would be 0-20%, 21-40%, and so on. You would presume that all the different activities should sum to 100%; after all, you can’t really spend 80% of your online time gaming, and then another 80% looking at porn–unless you’re either a very talented gamer, or have an interesting taste in “games”. Yet, when M&G report absolute numbers for the different activities in tables, they’re not given in percentages at all. Instead, one of the table captions indicates that the values are actually coded on a 6-point Likert scale ranging from “rarely/never” to “very frequently”. Hopefully you can see why this is a problem: if you claim (as M&G do) that your results reflect the relative proportion of time that people spend on different activities, you shouldn’t be allowing people to essentially say anything they like for each activity. Given that people with high IA scores report spending more time overall than they’d like online, is it any surprise if they also report spending more time on individual online activities? The claim that high-IA scorers spend “proportionately more” time on some activities just doesn’t seem to be true–at least, not based on the data M&G report. This might also explain how it could be that IA scores correlated positively with nearly all individual activities. That simply couldn’t be true for real proportions (if you spend proportionately more time on e-mail, you must be spending proportionately less time somewhere else), but it makes perfect sense if the response scale is actually anchored with vague terms like “rarely” and “frequently”.

Fourth, M&G consider two possibilities for the positive correlation between IAT and depression scores: (a) increased internet use causes depression, and (b) depression causes increased internet use. But there’s a third, and to my mind far more plausible, explanation: people who are depressed tend to have more negative self-perceptions, and are much more likely to endorse virtually any question that asks about dissatisfaction with one’s own behavior. Here are a couple of examples of questions on the IAT: “How often do you fear that life without the Internet would be boring, empty, and joyless?” “How often do you try to cut down the amount of time you spend on-line and fail?” Notice that there are really two components to these kinds of questions. One component is internet-specific: to what extent are people specifically concerned about their behavior online, versus in other domains? The other component is a general hedonic one, and has to do with how dissatisfied you are with stuff in general. Now, is there any doubt that, other things being equal, someone who’s depressed is going to be more likely to endorse an item that asks how often they fail at something? Or how often their life feels empty and joyless–irrespective of cause? No, of course not. Depressive people tend to ruminate and worry about all sorts of things. No doubt internet usage is one of those things, but that hardly makes it special or interesting. I’d be willing to bet money that if you created a Shoelace Tying Questionnaire that had questions like “How often do you worry about your ability to tie your shoelaces securely?” and “How often do you try to keep your shoelaces from coming undone and fail?”, you’d also get a positive correlation with BDI scores. Basically, depression and trait negative affect tend to correlate positively with virtually every measure that has a major evaluative component. That’s not news. To the contrary, given the types of questions on the IAT, it would have been astonishing if there wasn’t a robust positive correlation with depression.

Fifth, and related to the previous point, no evidence is ever actually provided that people with high IAT scores differ in their objective behavior from those with low scores. Remember, this is all based on self-report. And not just self-report, but vague self-report. As far as I can tell, M&G never asked respondents to estimate how much time they spent online in a given week. So it’s entirely possible that people who report spending too much time online don’t actually spend much more time online than anyone else; they just feel that way (again, possibly because of a generally negative disposition). There’s actually some support for this idea: A 2004 study that sought to validate the IAT psychometrically found only a .22 correlation between IAT scores and self-reported time spent online. Now, a .22 correlation is perfectly meaningful, and it suggests that people who feel they spend too much time online also estimate that they really do spend more time online (though, again, bias is a possibility here too). But it’s a much smaller correlation than the one between IAT scores and depression, which fits with the above idea that there may not be any real “link” between internet use and depression above and beyond the fact that depressed individuals are more likely to more negatively-worded items.

Finally, even if you ignore the above considerations, and decide to conclude that there is in fact a non-artifactual correlation between depression and internet use, there’s really no reason you would conclude that that’s a bad thing (which M&G hedge on, and many of the news articles haven’t hesitated to play up). It’s entirely plausible that the reason depressed individuals might spend more time online is because it’s an effective form of self-medication. If you’re someone who has trouble mustering up the energy to engage with the outside world, or someone who’s socially inhibited, online communities might provide you with a way to fulfill your social needs in a way that you would otherwise not have been able to. So it’s quite conceivable that heavy internet use makes people less depressed, not more; it’s just that the people who are more likely to use the internet heavily are more depressed to begin with. I’m not suggesting that this is in fact true (I find the artifactual explanation for the IAT-BDI correlation suggested above much more plausible), but just that the so-called “dark side” of the internet could actually be a very good thing.

In sum, what can we learn from M&G’s paper? Not that much. To be fair, I don’t necessarily think it’s a terrible paper; it has its limitations, but every paper does. The problem isn’t so much that the paper is bad; it’s that the findings it contains were blown entirely out of proportion, and twisted to support headlines (most of them involving the phrase “The Dark Side”) that they couldn’t possibly support. The internet may or may not cause depression (probably not), but you’re not going to get much traction on that question by polling a sample of internet respondents, using measures that have a conceptual overlap with depression, and defining groups based on arbitrary cut-offs. The jury remains open, of course, but these findings by themselves don’t really give us any reason to reconsider or try to change our online behavior.

ResearchBlogging.org
Morrison, C., & Gore, H. (2010). The Relationship between Excessive Internet Use and Depression: A Questionnaire-Based Study of 1,319 Young People and Adults Psychopathology, 43 (2), 121-126 DOI: 10.1159/000277001