the New York Times blows it big time on brain imaging

The New York Times has a terrible, terrible Op-Ed piece today by Martin Lindstrom (who I’m not going to link to, because I don’t want to throw any more bones his way). If you believe Lindstrom, you don’t just like your iPhone a lot; you love it. Literally. And the reason you love it, shockingly, is your brain:

Earlier this year, I carried out an fMRI experiment to find out whether iPhones were really, truly addictive, no less so than alcohol, cocaine, shopping or video games. In conjunction with the San Diego-based firm MindSign Neuromarketing, I enlisted eight men and eight women between the ages of 18 and 25. Our 16 subjects were exposed separately to audio and to video of a ringing and vibrating iPhone.

But most striking of all was the flurry of activation in the insular cortex of the brain, which is associated with feelings of love and compassion. The subjects’ brains responded to the sound of their phones as they would respond to the presence or proximity of a girlfriend, boyfriend or family member.

In short, the subjects didn’t demonstrate the classic brain-based signs of addiction. Instead, they loved their iPhones.

There’s so much wrong with just these three short paragraphs (to say nothing of the rest of the article, which features plenty of other whoppers) that it’s hard to know where to begin. But let’s try. Take first the central premise–that an fMRI experiment could help determine whether iPhones are no less addictive than alcohol or cocaine. The tacit assumption here is that all the behavioral evidence you could muster–say, from people’s reports about how they use their iPhones, or clinicians’ observations about how iPhones affect their users–isn’t sufficient to make that determination; to “really, truly” know if something’s addictive, you need to look at what the brain is doing when people think about their iPhones. This idea is absurd inasmuch as addiction is defined on the basis of its behavioral consequences, not (right now, anyway) by the presence or absence of some biomarker. What makes someone an alcoholic is the fact that they’re dependent on alcohol, have trouble going without it, find that their alcohol use interferes with multiple aspects of their day-to-day life, and generally suffer functional impairment because of it–not the fact that their brain lights up when they look at pictures of Johnny Walker red. If someone couldn’t stop drinking–to the point where they lost their job, family, and friends–but their brain failed to display a putative biomarker for addiction, it would be strange indeed to say “well, you show all the signs, but I guess you’re not really addicted to alcohol after all.”

Now, there may come a day (and it will be a great one) when we have biomarkers sufficiently accurate that they can stand in for the much more tedious process of diagnosing someone’s addiction the conventional way. But that day is, to put it gently, a long way off. Right now, if you want to know if iPhones are addictive, the best way to do that is to, well, spend some time observing and interviewing iPhone users (and some quantitative analysis would be helpful).

Of course, it’s not clear what Lindstrom thinks an appropriate biomarker for addiction would be in any case. Presumably it would have something to do with the reward system; but what? Suppose Lindstrom had seen robust activation in the ventral striatum–a critical component of the brain’s reward system–when participants gazed upon the iPhone: what then? Would this have implied people are addicted to iPhones? But people also show striatal activity when gazing on food, money, beautiful faces, and any number of other stimuli. Does that mean the average person is addicted to all of the above? A marker of pleasure or reward, maybe (though even that’s not certain), but addiction? How could a single fMRI experiment with 16 subjects viewing pictures of iPhones confirm or disconfirm the presence of addiction? Lindstrom doesn’t say. I suppose he has good reason not to say: if he really did have access to an accurate fMRI-based biomarker for addiction, he’d be in a position to make millions (billions?) off the technology. To date, no one else has come close to identifying a clinically accurate fMRI biomarker for any kind of addiction (for more technical readers, I’m talking here about cross-validated methods that have both sensitivity and specificity comparable to traditional approaches when applied to new subjects–not individual studies that claim 90% with-sample classification accuracy based on simple regression models). So we should, to put it mildly, be very skeptical that Lindstrom’s study was ever in a position to do what he says it was designed to do.

We should also ask all sorts of salient and important questions about who the people are who are supposedly in love with their iPhones. Who’s the “You” in the “You Love Your iPhone” of the title? We don’t know, because we don’t know who the participants in Lindstrom’s sample, were, aside from the fact that they were eight men and eight women aged 18 to 25. But we’d like to know some other important things. For instance, were they selected for specific characteristics? Were they, say, already avid iPhone users? Did they report loving, or being addicted to their iPhones? If so, would it surprise us that people chosen for their close attachment to their iPhones also showed brain activity patterns typical of close attachment? (Which, incidentally, they actually don’t–but more on that below.) And if not, are we to believe that the average person pulled off the street–who probably has limited experience with iPhones–really responds to the sound of their phones “as they would respond to the presence or proximity of a girlfriend, boyfriend or family member”? Is the takeaway message of Lindstrom’s Op-Ed that iPhones are actually people, as far as our brains are concerned?

In fairness, space in the Times is limited, so maybe it’s not fair to demand this level of detail in the Op-Ed iteslf. But the bigger problem is that we have no way of evaluating Lindstrom’s claims, period, because (as far as I can tell), his study hasn’t been published or peer-reviewed anywhere. Presumably, it’s proprietary information that belongs to the neuromarketing firm in question. Which is to say, the NYT is basically giving Lindstrom license to talk freely about scientific-sounding findings that can’t actually be independently confirmed, disputed, or critiqued by members of the scientific community with expertise in the very methods Lindstrom is applying (expertise which, one might add, he himself lacks). For all we know, he could have made everything up. To be clear, I don’t really think he did make everything up–but surely, somewhere in the editorial process someone at the NYT should have stepped in and said, “hey, these are pretty strong scientific claims; is there any way we can make your results–on which your whole article hangs–available for other experts to examine?”

This brings us to what might be the biggest whopper of all, and the real driver of the article title: the claim that “most striking of all was the flurry of activation in the insular cortex of the brain, which is associated with feelings of love and compassion“. Russ Poldrack already tore this statement to shreds earlier this morning:

Insular cortex may well be associated with feelings of love and compassion, but this hardly proves that we are in love with our iPhones.  In Tal Yarkoni’s recent paper in Nature Methods, we found that the anterior insula was one of the most highly activated part of the brain, showing activation in nearly 1/3 of all imaging studies!  Further, the well-known studies of love by Helen Fisher and colleagues don’t even show activation in the insula related to love, but instead in classic reward system areas.  So far as I can tell, this particular reverse inference was simply fabricated from whole cloth.  I would have hoped that the NY Times would have learned its lesson from the last episode.

But you don’t have to take Russ’s word for it; if you surf for a few terms on our Neurosynth website, making sure to select “forward inference” under image type, you’ll notice that the insula shows up for almost everything. That’s not an accident; it’s because the insula (or at least the anterior part of the insula) plays a very broad role in goal-directed cognition. It really is activated when you’re doing almost anything that involves, say, following instructions an experimenter gave you, or attending to external stimuli, or mulling over something salient in the environment. You can see this pretty clearly in this modified figure from our Nature Methods paper (I’ve circled the right insula):

Proportion of studies reporting activation at each voxel

The insula is one of a few ‘hotspots’ where activation is reported very frequently in neuroimaging articles (the other major one being the dorsal medial frontal cortex). So, by definition, there can’t be all that much specificity to what the insula is doing, since it pops up so often. To put it differently, as Russ and others have repeatedly pointed out, the fact that a given region activates when people are in a particular psychological state (e.g., love) doesn’t give you license to conclude that that state is present just because you see activity in the region in question. If language, working memory, physical pain, anger, visual perception, motor sequencing, and memory retrieval all activate the insula, then knowing that the insula is active is of very little diagnostic value. That’s not to say that some psychological states might not be more strongly associated with insula activity (again, you can see this on Neurosynth if you switch the image type to ‘reverse inference’ and browse around); it’s just that, probabilistically speaking, the mere fact that the insula is active gives you very little basis for saying anything concrete about what people are experiencing.

In fact, to account for Lindstrom’s findings, you don’t have to appeal to love or addiction at all. There’s a much simpler way to explain why seeing or hearing an iPhone might elicit insula activation. For most people, the onset of visual or auditory stimulation is a salient event that causes redirection of attention to the stimulated channel. I’d be pretty surprised, actually, if you could present any picture or sound to participants in an fMRI scanner and not elicit robust insula activity. Orienting and sustaining attention to salient things seems to be a big part of what the anterior insula is doing (whether or not that’s ultimately its ‘core’ function). So the most appropriate conclusion to draw from the fact that viewing iPhone pictures produces increased insula activity is something vague like “people are paying more attention to iPhones”, or “iPhones are particularly salient and interesting objects to humans living in 2011.” Not something like “no, really, you love your iPhone!”

In sum, the NYT screwed up. Lindstrom appears to have a habit of making overblown claims about neuroimaging evidence, so it’s not surprising he would write this type of piece; but the NYT editorial staff is supposedly there to filter out precisely this kind of pseudoscientific advertorial. And they screwed up. It’s a particularly big screw-up given that (a) as of right now, Lindstrom’s Op-Ed is the single most emailed article on the NYT site, and (b) this incident almost perfectly recapitulates another NYT article 4 years ago in which some neuroscientists and neuromarketers wrote a grossly overblown Op-Ed claiming to be able to infer, in detail, people’s opinions about presidential candidates. That time, Russ Poldrack and a bunch of other big names in cognitive neuroscience wrote a concise rebuttal that appeared in the NYT (but unfortunately, isn’t linked to from the original Op-Ed, so anyone who stumbles across the original now has no way of knowing how ridiculous it is). One hopes the NYT follows up in similar fashion this time around. They certainly owe it to their readers–some of whom, if you believe Lindstrom, are now in danger of dumping their current partners for their iPhones.

h/t: Molly Crockett

does functional specialization exist in the language system?

One of the central questions in cognitive neuroscience–according to some people, at least–is how selective different chunks of cortex are for specific cognitive functions. The paradigmatic examples of functional selectivity are pretty much all located in sensory cortical regions or adjacent association cortices. For instance, the fusiform face area (FFA), is so named because it (allegedly) responds selectively to faces but not to other stimuli. Other regions with varying selectivity profiles are similarly named: the visual word form area (VWFA), parahippocampal place area (PPA), extrastriate body area (EBA), and so on.

In a recent review paper, Fedorenko and Kanwisher (2009) sought to apply insights from the study of functionally selective visual regions to the study of language. They posed the following question with respect to the neuroimaging of language in the title of their paper: Why hasn’t a clearer picture emerged? And they gave the following answer: it’s because brains differ from one another, stupid.

Admittedly, I’m paraphrasing; they don’t use exactly those words. But the basic point they make is that it’s difficult to identify functionally selective regions when you’re averaging over a bunch of very different brains. And the solution they propose–again, imported from the study of visual areas–is to identify potentially selective language regions-of-interest (ROIs) on a subject-specific basis rather than relying on group-level analyses.

The Fedorenko and Kanwisher paper apparently didn’t please Greg Hickok of Talking Brains, who’s done a lot of very elegant work on the neurobiology of language.  A summary of Hickok’s take:

What I found a bit on the irritating side though was the extremely dim and distressingly myopic view of progress in the field of the neural basis of language.

He objects to Fedorenko and Kanwisher on several grounds, and the post is well worth reading. But since I’m very lazy tired, I’ll just summarize his points as follows:

  • There’s more functional specialization in the language system than F&K give the field credit for
  • The use of subject-specific analyses in the domain of language isn’t new, and many researchers (including Hickok) have used procedures similar to those F&K recommend in the past
  • Functional selectivity is not necessarily a criterion we should care about all that much anyway

As you might expect, F&K disagree with Hickok on these points, and Hickok was kind enough to post their response. He then responded to their response in the comments (which are also worth reading), which in turn spawned a back-and-forth with F&K, a cameo by Brad Buchsbaum (who posted his own excellent thoughts on the matter here), and eventually, an intervention by a team of professional arbitrators. Okay, I made that last bit up; it was a very civil disagreement, and is exactly what scientific debates on the internet should look like, in my opinion.

Anyway, rather than revisit the entire thread, which you can read for yourself, I’ll just summarize my thoughts:

  • On the whole, I think my view lines up pretty closely with Hickok’s and Buchsbaum’s. Although I’m very far from an expert on the neurobiology of language (is there a word in English for someone’s who’s the diametric opposite of an expert–i.e., someone who consistently and confidently asserts exactly the wrong thing? Cause that’s what I am), I agree with Hickok’s argument that the temporal poles show a response profile that looks suspiciously like sentence- or narrative-specific processing (I have a paper on the neural mechanisms of narrative comprehension that supports that claim to some extent), and think F&K’s review of the literature is probably not as balanced as it could have been.
  • More generally, I agree with Hickok that demonstrating functional specialization isn’t necessarily that important to the study of language (or most other domains). This seems to be a major point of contention for F&K, but I don’t think they make a very strong case for their view. They suggest that they “are not sure what other goals (besides understanding a region’s computations) could drive studies aimed at understanding how functionally specialized a region is,” which I think is reasonable, but affirms the consequent. Hickok isn’t saying there’s no reason to search for functional specialization in the F&K sense; as I read him, he’s simply saying that you can study the nature of neural computation in lots of interesting ways that don’t require you to demonstrate functional specialization to the degree F&K seem to require. Seems hard to disagree with that.
  • Buchsbaum points out that it’s questionable whether there are any brain regions that meet the criteria F&K set out for functional specialization–namely that “A brain region R is specialized for cognitive function x if this region (i) is engaged in tasks that rely on cognitive function x, and (ii) is not engaged in tasks that do not rely on cognitive function x.Buchsbaum and Hickok both point out that the two examples F&K give of putatively specialized regions (the FFA and the temporo-parietal junction, which some people believe is selectively involved in theory of mind) are hardly uncontroversial. Plenty of people have argued that the FFA isn’t really selective to faces, and even more people have argued that the TPJ isn’t selective to theory of mind. As far as I can tell, F&K don’t really address this issue in the comments. They do refer to a recent paper of Kanwisher’s that discusses the evidence for functional specificity in the FFA, but I’m not sure the argument made in that paper is itself uncontroversial, and in any case, Kanwisher does concede that there’s good evidence for at least some representation of non-preferred stimuli (i.e., non-faces in the FFA). In any case, the central question here is whether or not F&K really unequivocally believe that FFA and TPJ aren’t engaged by any tasks that don’t involve face or theory of mind processing. If not, then it’s unfair to demand or expect the same of regions implicated in language.
  • Although I think there’s a good deal to be said for subject-specific analyses, I’m not as sanguine as F&K that a subject-specific approach offers a remedy to the problems that they perceive afflict the study of the neural mechanisms of language. While there’s no denying that group analyses suffer from a number of limitations, subject-specific analyses have their own weaknesses, which F&K don’t really mention in their paper. One is that such analyses typically require the assumption that two clusters located in slightly different places for different subjects must be carrying out the same cognitive operations if they respond similarly to a localizer task. That’s a very strong assumption for which there’s very little evidence (at least in the language domain)–especially because the localizer task F&K promote in this paper involves a rather strong manipulation that may confound several different aspects of language processing.
    Another problem is that it’s not at all obvious how you determine which regions are the “same” (in their 2010 paper, F&K argue for an algorithmic parcellation approach, but the fact that you get sensible-looking results is no guarantee that your parcellation actually reflects meaningful functional divisions in individual subjects). And yet another is that serious statistical problems can arise in cases where one or more subjects fail to show activation in a putative region (which is generally the norm rather than the exception). Say you have 25 subjects in your sample, and 7 don’t show activation anywhere in a region that can broadly be called Broca’s area. What do you do? You can’t just throw those subjects out of the analysis, because that would grossly and misleadingly inflate your effect sizes. Conversely, you can’t just identify any old region that does activate and lump it in with the regions identified in all the other subjects. This is a very serious problem, but it’s one that group analyses, for all their weaknesses, don’t have to contend with.

Disagreements aside, I think it’s really great to see serious scientific discussion taking place in this type of forum. In principle, this is the kind of debate that should be resolved (or not) in the peer-reviewed literature; in practice, peer review is slow, writing full-blown articles takes time, and journal space is limited. So I think blogs have a really important role to play in scientific communication, and frankly, I envy Hickok and Poeppel for the excellent discussion they consistently manage to stimulate over at Talking Brains!

fourteen questions about selection bias, circularity, nonindependence, etc.

A new paper published online this week in the Journal of Cerebral Blood Flow & Metabolism this week discusses the infamous problem of circular analysis in fMRI research. The paper is aptly titled “Everything you never wanted to know about circular analysis, but were afraid to ask,” and is authored by several well-known biostatisticians and cognitive neuroscientists–to wit, Niko Kriegeskorte, Martin Lindquist, Tom Nichols, Russ Poldrack, and Ed Vul. The paper has an interesting format, and one that I really like: it’s set up as a series of fourteen questions related to circular analysis, and each author answers each question in 100 words or less.

I won’t bother going over the gist of the paper, because the Neuroskeptic already beat me to the punch in an excellent post a couple of days ago (actually, that’s how I found out about the paper); instead,  I’ll just give my own answers to the same set of questions raised in the paper. And since blog posts don’t have the same length constraints as NPG journals, I’m going to be characteristically long-winded and ignore the 100 word limit…

(1) Is circular analysis a problem in systems and cognitive neuroscience?

Yes, it’s a huge problem. That said, I think the term ‘circular’ is somewhat misleading here, because it has the connotation than an analysis is completely vacuous. Truly circular analyses–i.e., those where an initial analysis is performed, and the researchers then conduct a “follow-up” analysis that literally adds no new information–are relatively rare in fMRI research. Much more common are cases where there’s some dependency between two different analyses, but the second one still adds some novel information.

(2) How widespread are slight distortions and serious errors caused by circularity in the neuroscience literature?

I think Nichols sums it up nicely here:

TN: False positives due to circularity are minimal; biased estimates of effect size are common. False positives due to brushing off the multiple testing problem (e.g., “˜P<0.001 uncorrected’ and crossing your fingers) remain pervasive.

The only thing I’d add to this is that the bias in effect size estimates is not only common, but, in most cases, is probably very large.

(3) Are circular estimates useful measures of effect size?

Yes and no. They’re less useful than unbiased measures of effect size. But given that the vast majority of effects reported in whole-brain fMRI analyses (and, more generally, analyses in most fields) are likely to be inflated to some extent, the only way to ensure we don’t rely on circular estimates of effect size would be to disregard effect size estimates entirely, which doesn’t seem prudent.

(4) Should circular estimates of effect size be presented in papers and, if so, how?

Yes, because the only principled alternatives are to either (a) never report effect sizes (which seems much too drastic), or (b) report the results of every single test performed, irrespective of the result (i.e., to never give selection bias an opportunity to rear its head). Neither of these is reasonable. We should generally report effect sizes for all key effects, but they should be accompanied by appropriate confidence intervals. As Lindquist notes:

In general, it may be useful to present any effect size estimate as confidence intervals, so that readers can see for themselves how much uncertainty is related to the point estimate.

A key point I’d add is that the width of the reported CIs should match the threshold used to identify results in the first place. In other words, if you conduct a whole brain analysis at p < .001, you should report all resulting effects with 99.9% CIs, and not 95% CIs. I think this simple step would go a considerable ways towards conveying the true uncertainty surrounding most point estimates in fMRI studies.

(5) Are effect size estimates important/useful for neuroscience research, and why?

I think my view here is closest to Ed Vul’s:

Yes, very much so. Null-hypothesis testing is insufficient for most goals of neuroscience because it can only indicate that a brain region is involved to some nonzero degree in some task contrast. This is likely to be true of most combinations of task contrasts and brain regions when measured with sufficient power.

I’d go further than Ed does though, and say that in a sense, effect size estimates are the only things that matter. As Ed notes, there are few if any cases where it’s plausible to suppose that the effect of some manipulation on brain activation is really zero. The brain is a very dense causal system–almost any change in one variable is going to have downstream effects on many, and perhaps most, others. So the real question we care about is almost never “is there or isn’t there an effect,” it’s whether there’s an effect that’s big enough to actually care about. (This problem isn’t specific to fMRI research, of course; it’s been a persistent source of criticism of null hypothesis significance testing for many decades.)

People sometimes try to deflect this concern by saying that they’re not trying to make any claims about how big an effect is, but only about whether or not one can reject the null–i.e., whether any kind of effect is present or not. I’ve never found this argument convincing, because whether or not you own up to it, you’re always making an effect size claim whenever you conduct a hypothesis test. Testing against a null of zero is equivalent to saying that you care about any effect that isn’t exactly zero, which is simply false. No one in fMRI research cares about r or d values of 0.0001, yet we routinely conduct tests whose results could be consistent with those types of effect sizes.

Since we’re always making implicit claims about effect sizes when we conduct hypothesis tests, we may as well make them explicit so that they can be evaluated properly. If you only care about correlations greater than 0.1, there’s no sense in hiding that fact; why not explicitly test against a null range of -0.1 to 0.1, instead of a meaningless null of zero?

(6) What is the best way to accurately estimate effect sizes from imaging data?

Use large samples, conduct multivariate analyses, report results comprehensively, use meta-analysis… I don’t think there’s any single way to ensure accurate effect size estimates, but plenty of things help. Maybe the most general recommendation is to ensure adequate power (see below), which will naturally minimize effect size inflation.

(7) What makes data sets independent? Are different sets of subjects required?

Most of the authors think (as I do too) that different sets of subjects are indeed required in order to ensure independence. Here’s Nichols:

Only data sets collected on distinct individuals can be assured to be independent. Splitting an individual’s data (e.g., using run 1 and run 2 to create two data sets) does not yield independence at the group level, as each subject’s true random effect will correlate the data sets.

Put differently, splitting data within subjects only eliminates measurement error, and not sampling error. You could in theory measure activation perfectly reliably (in which case the two halves of subjects’ data would be perfectly correlated) and still have grossly inflated effects, simply because the multivariate distribution of scores in your sample doesn’t accurately reflect the distribution in the population. So, as Nichols points out, you always need new subjects if you want to be absolutely certain your analyses are independent. But since this generally isn’t feasible, I’d argue we should worry less about whether or not our data sets are completely independent, and more about reporting results in a way that makes the presence of any bias as clear as possible.

(8) What information can one glean from data selected for a certain effect?

I think this is kind of a moot question, since virtually all data are susceptible to some form of selection bias (scientists generally don’t write papers detailing all the analyses they conducted that didn’t pan out!). As I note above, I think it’s a bad idea to disregard effect sizes entirely; they’re actually what we should be focusing most of our attention on. Better to report confidence intervals that accurately reflect the selection procedure and make the uncertainty around the point estimate clear.

(9) Are visualizations of nonindependent data helpful to illustrate the claims of a paper?

Not in cases where there’s an extremely strong dependency between the selection criteria and the effect size estimate. In cases of weak to moderate dependency, visualization is fine so long as confidence bands are plotted alongside the best fit. Again, the key is to always be explicit about the limitations of the analysis and provide some indication of the uncertainty involved.

(10) Should data exploration be discouraged in favor of valid confirmatory analyses?

No. I agree with Poldrack’s sentiment here:

Our understanding of brain function remains incredibly crude, and limiting research to the current set of models and methods would virtually guarantee scientific failure. Exploration of new approaches is thus critical, but the findings must be confirmed using new samples and convergent methods.

(11) Is a confirmatory analysis safer than an exploratory analysis in terms of drawing neuroscientific conclusions?

In principle, sure, but in practice, it’s virtually impossible to determine which reported analyses really started out their lives as confirmatory analyses and which started life out as exploratory analyses and then mysteriously evolved into “a priori” predictions once the paper was written. I’m not saying there’s anything wrong with this–everyone reports results strategically to some extent–just that I don’t know that the distinction between confirmatory and exploratory analyses is all that meaningful in practice. Also, as the previous point makes clear, safety isn’t the only criterion we care about; we also want to discover new and unexpected findings, which requires exploration.

(12) What makes a whole-brain mapping analysis valid? What constitutes sufficient adjustment for multiple testing?

From a hypothesis testing standpoint, you need to ensure adequate control of the family-wise error (FWE) rate or false discovery rate (FDR). But as I suggested above, I think this only ensures validity in a limited sense; it doesn’t ensure that the results are actually going to be worth caring about. If you want to feel confident that any effects that survive are meaningfully large, you need to do the extra work up front and define what constitutes a meaningful effect size (and then test against that).

(13) How much power should a brain-mapping analysis have to be useful?

As much as possible! Concretely, the conventional target of 80% seems like a good place to start. But as I’ve argued before (e.g., here), that would require more than doubling conventional sample sizes in most cases. The reality is that fMRI studies are expensive, so we’re probably stuck with underpowered analyses for the foreseeable future. So we need to find other ways to compensate for that (e.g., relying more heavily on meta-analytic effect size estimates).

(14) In which circumstances are nonindependent selective analyses acceptable for scientific publication?

It depends on exactly what’s problematic about the analysis. Analyses that are truly circular and provide no new information should never be reported, but those constitute only a small fraction of all analyses. More commonly, the nonindependence simply amounts to selection bias: researchers tend to report only those results that achieve statistical significance, thereby inflating apparent effect sizes. I think the solution to this is to still report all key effect sizes, but to ensure they’re accompanied by confidence intervals and appropriate qualifiers.

ResearchBlogging.orgKriegeskorte N, Lindquist MA, Nichols TE, Poldrack RA, & Vul E (2010). Everything you never wanted to know about circular analysis, but were afraid to ask. Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism PMID: 20571517

time-on-task effects in fMRI research: why you should care

There’s a ubiquitous problem in experimental psychology studies that use behavioral measures that require participants to make speeded responses. The problem is that, in general, the longer people take to do something, the more likely they are to do it correctly. If I have you do a visual search task and ask you to tell me whether or not a display full of letters contains a red ‘X’, I’m not going to be very impressed that you can give me the right answer if I let you stare at the screen for five minutes before responding. In most experimental situations, the only way we can learn something meaningful about people’s capacity to perform a task is by imposing some restriction on how long people can take to respond. And the problem that then presents is that any changes we observe in the resulting variable we care about (say, the proportion of times you successfully detect the red ‘X’) are going to be confounded with the time people took to respond. Raise the response deadline and performance goes up; shorten it and performance goes down.

This fundamental fact about human performance is commonly referred to as the speed-accuracy tradeoff. The speed-accuracy tradeoff isn’t a law in any sense; it allows for violations, and there certainly are situations in which responding quickly can actually promote accuracy. But as a general rule, when researchers run psychology experiments involving response deaadlines, they usually work hard to rule out the speed-accuracy tradeoff as an explanation for any observed results. For instance, if I have a group of adolescents with ADHD do a task requiring inhibitory control, and compare their performance to a group of adolescents without ADHD, I may very well find that the ADHD group performs more poorly, as reflected by lower accuracy rates. But the interpretation of that result depends heavily on whether or not there are also any differences in reaction times (RT). If the ADHD group took about as long on average to respond as the non-ADHD group, it might be reasonable to conclude that the ADHD group suffers a deficit in inhibitory control: they take as long as the control group to do the task, but they still do worse. On the other hand, if the ADHD group responded much faster than the control group on average, the interpretation would become more complicated. For instance, one possibility would be that the accuracy difference reflects differences in motivation rather than capacity per se. That is, maybe the ADHD group just doesn’t care as much about being accurate as about responding quickly. Maybe if you motivated the ADHD group appropriately (e.g., by giving them a task that was intrinsically interesting), you’d find that performance was actually equivalent across groups. Without explicitly considering the role of reaction time–and ideally, controlling for it statistically–the types of inferences you can draw about underlying cognitive processes are somewhat limited.

An important point to note about the speed-accuracy tradeoff is that it isn’t just a tradeoff between speed and accuracy; in principle, any variable that bears some systematic relation to how long people take to respond is going to be confounded with reaction time. In the world of behavioral studies, there aren’t that many other variables we need to worry about. But when we move to the realm of brain imaging, the game changes considerably. Nearly all fMRI studies measure something known as the blood-oxygen-level-dependent (BOLD) signal. I’m not going to bother explaining exactly what the BOLD signal is (there are plenty of other excellent explanations at varying levels of technical detail, e.g., here, here, or here); for present purposes, we can just pretend that the BOLD signal is basically a proxy for the amount of neural activity going on in different parts of the brain (that’s actually a pretty reasonable assumption, as emerging studies continue to demonstrate). In other words, a simplistic but not terribly inaccurate model is that when neurons in region X increase their firing rate, blood flow in region X also increases, and so in turn does the BOLD signal that fMRI scanners detect.

A critical question that naturally arises is just how strong the temporal relation is between the BOLD signal and underlying neuronal processes. From a modeling perspective, what we’d really like is a system that’s completely linear and time-invariant–meaning that if you double the duration of a stimulus presented to the brain, the BOLD response elicited by that stimulus also doubles, and it doesn’t matter when the stimulus is presented (i.e., there aren’t any funny interactions between different phases of the response, or with the responses to other stimuli). As it turns out, the BOLD response isn’t perfectly linear, but it’s pretty close. In a seminal series of studies in the mid-90s, Randy Buckner, Anders Dale and others showed that, at least for stimuli that aren’t presented extremely rapidly (i.e., a minimum of 1 – 2 seconds apart), we can reasonably pretend that the BOLD response sums linearly over time without suffering any serious ill effects. And that’s extremely fortunate, because it makes modeling brain activation with fMRI much easier to do. In fact, the vast majority of fMRI studies, which employ what are known as rapid event-related designs, implicitly assume linearity. If the hemodynamic response wasn’t approximately linear, we would have to throw out a very large chunk of the existing literature–or at least seriously question its conclusions.

Aside from the fact that it lets us model things nicely, the assumption of linearity has another critical, but underappreciated, ramification for the way we do fMRI research. Which is this: if the BOLD response sums approximately linearly over time, it follows that two neural responses that have the same amplitude but differ in duration will produce BOLD responses with different amplitudes. To characterize that visually, here’s a figure from a paper I published with Deanna Barch, Jeremy Gray, Tom Conturo, and Todd Braver last year:

plos_one_figure1

Each of these panels shows you the firing rates and durations of two hypothetical populations of neurons (on the left), along with the (observable) BOLD response that would result (on the right). Focus your attention on panel C first. What this panel shows you is what, I would argue, most people intuitively think of when they come across a difference in activation between two conditions. When you see time courses that clearly differ in their amplitude, it’s very natural to attribute a similar difference to the underlying neuronal mechanisms, and suppose that there must just be more firing going on in one condition than the other–where ‘more’ is taken to mean something like “firing at a higher rate”.

The problem, though, is that this inference isn’t justified. If you look at panel B, you can see that you get exactly the same pattern of observed differences in the BOLD response even when the amplitude of neuronal activation is identical, simply because there’s a difference in duration. In other words, if someone shows you a plot of two BOLD time courses for different experimental conditions, and one has a higher amplitude than the other, you don’t know whether that’s because there’s more neuronal activation in one condition than the other, or if processing is identical in both conditions but simply lasts longer in one than in the other. (As a technical aside, this equivalence only holds for short trials, when the BOLD response doesn’t have time to saturate. If you’re using longer trials–say 4 seconds more more–then it becomes fairly easy to tell apart changes in duration from changes in amplitude. But the vast majority of fMRI studies use much shorter trials, in which case the problem I describe holds.)

Now, functionally, this has some potentially very serious implications for the inferences we can draw about psychological processes based on observed differences in the BOLD response. What we would usually like to conclude when we report “more” activation for condition X than condition Y is that there’s some fundamental difference in the nature of the processes involved in the two conditions that’s reflected at the neuronal level. If it turns out that the reason we see more activation in one condition than the other is simply that people took longer to respond in one condition than in the other, and so were sustaining attention for longer, that can potentially undermine that conclusion.

For instance, if you’re contrasting a feature search condition with a conjunction search condition, you’re quite likely to observe greater activation in regions known to support visual attention. But since a central feature of conjunction search is that it takes longer than a feature search, it could theoretically be that the same general regions support both types of search, and what we’re seeing is purely a time-on-task effect: visual attention regions are activated for longer because it takes longer to complete the conjunction search, but these regions aren’t doing anything fundamentally different in the two conditions (at least at the level we can see with fMRI). So this raises an issue similar to the speed-accuracy tradeoff we started with. Other things being equal, the longer it takes you to respond, the more activation you’ll tend to see in a given region. Unless you explicitly control for differences in reaction time, your ability to draw conclusions about underlying neuronal processes on the basis of observed BOLD differences may be severely hampered.

It turns out that very few fMRI studies actually control for differences in RT. In an elegant 2008 study discussing different ways of modeling time-varying signals, Jack Grinband and colleagues reviewed a random sample of 170 studies and found that, “Although response times were recorded in 82% of event-related studies with a decision component, only 9% actually used this information to construct a regression model for detecting brain activity”. Here’s what that looks like (Panel C), along with some other interesting information about the procedures used in fMRI studies:

grinband_figure
So only one in ten studies made any effort to control for RT differences; and Grinband et al argue in their paper that most of those papers didn’t model RT the right way anyway (personally I’m not sure I agree; I think there are tradeoffs associated with every approach to modeling RT–but that’s a topic for another post).

The relative lack of attention to RT differences is particularly striking when you consider what cognitive neuroscientists do care a lot about: differences in response accuracy. The majority of researchers nowadays make a habit of discarding all trials on which participants made errors. The justification we give for this approach–which is an entirely reasonable one–is that if we analyzed correct and incorrect trials together, we’d be confounding the processes we care about (e.g., differences between conditions) with activation that simply reflects error-related processes. So we drop trials with errors, and that gives us cleaner results.

I suspect that the reasons for our concern with accuracy effects but not RT effects in fMRI research are largely historical. In the mid-90s, when a lot of formative cognitive neuroscience was being done, people (most of them then located in Pittsburgh, working in Jonathan Cohen‘s group) discovered that the brain doesn’t like to make errors. When people make mistakes during task performance, they tend to recognize that fact; on a neural level, frontoparietal regions implicated in goal-directed processing–and particularly the anterior cingulate cortex–ramp up activation substantially. The interpretation of this basic finding has been a source of much contention among cognitive neuroscientists for the past 15 years, and remains a hot area of investigation. For present purposes though, we don’t really care why error-related activation arises; the point is simply that it does arise, and so we do the obvious thing and try to eliminate it as a source of error from our analyses. I suspect we don’t do the same for RT not because we lack principled reasons to, but because there haven’t historically been clear-cut demonstrations of the effects of RT differences on brain activation.

The goal of the 2009 study I mentioned earlier was precisely to try to quantify those effects. The hypothesis my co-authors and I tested was straightforward: if brain activity scales approximately linearly with RT (as standard assumptions would seem to entail), we should see a strong “time-on-task” effect in brain areas that are associated with the general capacity to engage in goal-directed processing. In other words, on trials when people take longer to respond, activation in frontal and parietal regions implicated in goal-directed processing and cognitive control should increase. These regions are often collectively referred to as the “task-positive” network (Fox et al., 2005), in reference to the fact that they tend to show activation increases any time people are engaging in goal-directed processing, irrespective of the precise demands of the task. We figured that identifying a time-on-task effect in the task-positive network would provide a nice demonstration of the relation between RT differences and the BOLD response, since it would underscore the generality of the problem.

Concretely, what we did was take five datasets that were lying around from previous studies, and do a multi-study analysis focusing specifically on RT-related activation. We deliberately selected studies that employed very different tasks, designs, and even scanners, with the aim of ensuring the generalizability of the results. Then, we identified regions in each study in which activation covaried with RT on a trial-by-trial basis. When we put all of the resulting maps together and picked out only those regions that showed an association with RT in all five studies, here’s the map we got:

plos_one_figure2

There’s a lot of stuff going on here, but in the interest of keeping this post short slightly less excruciatingly long, I’ll stick to the frontal areas. What we found, when we looked at the timecourse of activation in those regions, was the predicted time-on-task effect. Here’s a plot of the timecourses from all five studies for selected regions:

plos_one_figure4

If you focus on the left time course plot for the medial frontal cortex (labeled R1, in row B), you can see that increases in RT are associated with increased activation in medial frontal cortex in all five studies (the way RT effects are plotted here is not completely intuitive, so you may want to read the paper for a clearer explanation). It’s worth pointing out that while these regions were all defined based on the presence of an RT effect in all five studies, the precise shape of that RT effect wasn’t constrained; in principle, RT could have exerted very different effects across the five studies (e.g., positive in some, negative in others; early in some, later in others; etc.). So the fact that the timecourses look very similar in all five studies isn’t entailed by the analysis, and it’s an independent indicator that there’s something important going on here.

The clear-cut implication of these findings is that a good deal of BOLD activation in most studies can be explained simply as a time-on-task effect. The longer you spend sustaining goal-directed attention to an on-screen stimulus, the more activation you’ll show in frontal regions. It doesn’t much matter what it is that you’re doing; these are ubiquitous effects (since this study, I’ve analyzed many other datasets in the same way, and never fail to find the same basic relationship). And it’s worth keeping in mind that these are just the regions that show common RT-related activation across multiple studies; what you’re not seeing are regions that covary with RT only within one (or for that matter, four) studies. I’d argue that most regions that show involvement in a task are probably going to show variations with RT. After all, that’s just what falls out of the assumption of linearity–an assumption we all depend on in order to do our analyses in the first place.

Exactly what proportion of results can be explained away as time-on-task effects? That’s impossible to determine, unfortunately. I suspect that if you could go back through the entire fMRI literature and magically control for trial-by-trial RT differences in every study, a very large number of published differences between experimental conditions would disappear. That doesn’t mean those findings were wrong or unimportant, I hasten to note; there are many cases in which it’s perfectly appropriate to argue that differences between conditions should reflect a difference in quantity rather than quality. Still, it’s clear that in many cases that isn’t the preferred interpretation, and controlling for RT differences probably would have changed the conclusions. As just one example, much of what we think of as a “conflict” effect in the medial frontal cortex/anterior cingulate could simply reflect prolonged attention on high-conflict trials. When you’re experiencing cognitive difficulty or conflict, you tend to slow down and take longer to respond, which is naturally going to produce BOLD increases that scale with reaction time. The question as to what remains of the putative conflict signal after you control for RT differences is one that hasn’t really been adequately addressed yet.

The practical question, of course, is what we should do about this. How can we minimize the impact of the time-on-task effect on our results, and, in turn, on the conclusions we draw? I think the most general suggestion is to always control for reaction time differences. That’s really the only way to rule out the possibility that any observed differences between conditions simply reflect differences in how long it took people to respond. This leaves aside the question of exactly how one should model out the effect of RT, which is a topic for another time (though I discuss it at length in the paper, and the Grinband paper goes into even more detail). Unfortunately, there isn’t any perfect solution; as with most things, there are tradeoffs inherent in pretty much any choice you make. But my personal feeling is that almost any approach one could take to modeling RT explicitly is a big step in the right direction.

A second, and nearly as important, suggestion is to not only control for RT differences, but to do it both ways. Meaning, you should run your model both with and without an RT covariate, and carefully inspect both sets of results. Comparing the results across the two models is what really lets you draw the strongest conclusions about whether activation differences between two conditions reflect a difference of quality or quantity. This point applies regardless of which hypothesis you favor: if you think two conditions draw on very similar neural processes that differ only in degree, your prediction is that controlling for RT should make effects disappear. Conversely, if you think that a difference in activation reflects the recruitment of qualitatively different processes, you’re making the prediction that the difference will remain largely unchanged after controlling for RT. Either way, you gain important information by comparing the two models.

The last suggestion I have to offer is probably obvious, and not very helpful, but for what it’s worth: be cautious about how you interpret differences in activation any time there are sizable differences in task difficulty and/or mean response time. It’s tempting to think that if you always analyze only trials with correct responses and follow the suggestions above to explicitly model RT, you’ve done all you need in order to perfectly control for the various tradeoffs and relationships between speed, accuracy, and cognitive effort. It really would be nice if we could all sleep well knowing that our data have unambiguous interpretations. But the truth is that all of these techniques for “controlling” for confounds like difficulty and reaction time are imperfect, and in some cases have known deficiencies (for instance, it’s not really true that throwing out error trials eliminates all error-related activation from analysis–sometimes when people don’t know the answer, they guess right!). That’s not to say we should stop using the tools we have–which offer an incredibly powerful way to peer inside our gourds–just that we should use them carefully.

ResearchBlogging.org

Yarkoni T, Barch DM, Gray JR, Conturo TE, & Braver TS (2009). BOLD correlates of trial-by-trial reaction time variability in gray and white matter: a multi-study fMRI analysis. PloS one, 4 (1) PMID: 19165335

Grinband J, Wager TD, Lindquist M, Ferrera VP, & Hirsch J (2008). Detection of time-varying signals in event-related fMRI designs. NeuroImage, 43 (3), 509-20 PMID: 18775784

elsewhere on the net

Some neat links from the past few weeks:

  • You Are No So Smart: A celebration of self-delusion. An excellent blog by journalist David McCraney that deconstructs common myths about the way the mind works.
  • NPR has a great story by Jon Hamilton about the famous saga of Einstein’s brain and what it’s helped teach us about brain function. [via Carl Zimmer]
  • The Neuroskeptic has a characteristically excellent 1,000 word explanation of how fMRI works.
  • David Rock has an interesting post on some recent work from Baumeister’s group purportedly showing that it’s good to believe in free will (whether or not it exists). My own feeling about this is that Baumeister’s not really studying people’s philosophical views about free will, but rather a construct closely related to self-efficacy and locus of control. But it’s certainly an interesting line of research.
  • The Prodigal Academic is a great new blog about all things academic. I’ve found it particularly interesting since several of the posts so far have been about job searches and job-seeking–something I’ll be experiencing my fill of over the next few months.
  • Prof-like Substance has a great 5-part series (1, 2, 3, 4, 5) on how blogging helps him as an academic. My own (much less eloquent) thoughts on that are here.
  • Cameron Neylon makes a nice case for the development of social webs for data mining.
  • Speaking of data mining, Michael Driscoll of Dataspora has an interesting pair of posts extolling the virtues of Big Data.
  • And just to balance things out, there’s this article in the New York Times by John Allen Paulos that offers some cautionary words about the challenges of using empirical data to support policy decisions.
  • On a totally science-less note, some nifty drawings (or is that photos?) by Ben Heine (via Crooked Brains):

fMRI, not coming to a courtroom near you so soon after all

That’s a terribly constructed title, I know, but bear with me. A couple of weeks ago I blogged about a courtroom case in Tennessee where the defense was trying to introduce fMRI to the courtroom as a way of proving the defendant’s innocence (his brain, apparently, showed no signs of guilt). The judge’s verdict is now in, and…. fMRI is out. In United States v. Lorne Semrau, Judge Pham recommended that the government’s motion to exclude fMRI scans from consideration be granted. That’s the outcome I think most respectable cognitive neuroscientists were hoping for; as many people associated with the case or interviewed about it have noted (and as the judge recognized), there just isn’t a shred of evidence to suggest that fMRI has any utility as a lie detector in real-world situations.

The judge’s decision, which you can download in PDF form here (hat-tip: Thomas Nadelhoffer), is really quite elegant, and worth reading (or at least skimming through). He even manages some subtle snark in places. For instance (my italics):

Regarding the existence and maintenance of standards, Dr. Laken testified as to the protocols and controlling standards that he uses for his own exams. Because the use of fMRI-based lie detection is still in its early stages of development, standards controlling the real-life application have not yet been established. Without such standards, a court cannot adequately evaluate the reliability of a particular lie detection examination. Cordoba, 194 F.3d at 1061. Assuming, arguendo, that the standards testified to by Dr. Laken could satisfy Daubert, it appears that Dr. Laken violated his own protocols when he re-scanned Dr. Semrau on the AIMS tests SIQs, after Dr. Semrau was found “deceptive“ on the first AIMS tests scan. None of the studies cited by Dr. Laken involved the subject taking a second exam after being found to have been deceptive on the first exam. His decision to conduct a third test begs the question whether a fourth scan would have revealed Dr. Semrau to be deceptive again.

The absence of real-life error rates, lack of controlling standards in the industry for real-life exams, and Dr. Laken’s apparent deviation from his own protocols are negative factors in the analysis of whether fMRI-based lie detection is scientifically valid. See Bonds, 12 F.3d at 560.

The reference here is to the fact that Laken and his company scanned Semrau (the defendant) on three separate occasions. The first two scans were planned ahead of time, but the third apparently wasn’t:

From the first scan, which included SIQs relating to defrauding the government, the results showed that Dr. Semrau was “not deceptive.“ However, from the second scan, which included SIQs relating to AIMS tests, the results showed that Dr. Semrau was “being deceptive.“ According to Dr. Laken, “testing indicates that a positive test result in a person purporting to tell the truth is accurate only 6% of the time.“ Dr. Laken also believed that the second scan may have been affected by Dr. Semrau’s fatigue. Based on his findings on the second test, Dr. Laken suggested that Dr. Semrau be administered another fMRI test on the AIMS tests topic, but this time with shorter questions and conducted later in the day to reduce the effects of fatigue. … The third scan was conducted on January 12, 2010 at around 7:00 p.m., and according to Dr. Laken, Dr. Semrau tolerated it well and did not express any fatigue. Dr. Laken reviewed this data on January 18, 2010, and concluded that Dr. Semrau was not deceptive. He further stated that based on his prior studies, “a finding such as this is 100% accurate in determining truthfulness from a truthful person.“

I may very well be misunderstanding something here (and so might the judge), but if the positive predictive value of the test is only 6%, I’m guessing that the probability that the test is seriously miscalibrated is somewhat higher than 6%. Especially since the base rate for lying among people who are accused of committing serious fraud is probably reasonably high (this matters, because when base rates are very low, low positive predictive values are not unexpected). But then, no one really knows how to calibrate these tests properly, because the data you’d need to do that simply don’t exist. Serious validation of fMRI as a tool for lie detection would require assembling a large set of brain scans from defendants accused of various crimes (real crimes, not simulated ones) and using that data to predict whether those defendants were ultimately found guilty or not. There really isn’t any substitute for doing a serious study of that sort, but as far as I know, no one’s done it yet. Fortunately, the few judges who’ve had to rule on the courtroom use of fMRI seem to recognize that.

Regarding the existence and maintenance of standards, Dr. Laken testified as to the protocols and controlling standards that he uses for his own exams. Because the use of fMRI-based lie detection is still in its early stages of development, standards controlling the real-life application have not yet been established. Without such standards, a court cannot adequately evaluate the reliability of a particular lie detection examination. Cordoba, 194 F.3d at 1061. Assuming, arguendo, that the standards testified to by Dr. Laken could satisfy Daubert, it appears that Dr. Laken violated his own protocols when he re-scanned Dr. Semrau on the AIMS tests SIQs, after Dr. Semrau was found “deceptive“ on the first AIMS tests scan. None of the studies cited by Dr. Laken involved the subject taking a second exam after being found to have been deceptive on the first exam. His decision to conduct a third test begs the question whether a fourth scan would have revealed Dr. Semrau to be deceptive again.
The absence of real-life error rates, lack of controlling standards in the industry for real-life exams, and Dr. Laken’s apparent deviation from his own protocols are negative factors in the analysis of whether fMRI-based lie detection is scientifically valid. See Bonds, 12 F.3d at 560

fMRI: coming soon to a courtroom near you?

Science magazine has a series of three (1, 2, 3) articles by Greg Miller over the past few days covering an interesting trial in Tennessee. The case itself seems like garden variety fraud, but the novel twist is that the defense is trying to introduce fMRI scans into the courtroom in order to establish the defendant’s innocent. As far as I can tell from Miller’s articles, the only scientists defending the use of fMRI as a lie detector are those employed by Cephos (the company that provides the scanning service); the other expert witnesses (including Marc Raichle!) seem pretty adamant that admitting fMRI scans as evidence would be a colossal mistake. Personally, I think there are several good reasons why it’d be a terrible, terrible, idea to let fMRI scans into the courtroom. In one way or another, they all boil down to the fact that just  isn’t any shred of evidence to support the use of fMRI as a lie detector in real-world (i.e, non-contrived) situations. Greg Miller has a quote from Martha Farah (who’s a spectator at the trial) that sums it up eloquently:

Farah sounds like she would have liked to chime in at this point about some things that weren’t getting enough attention. “No one asked me, but the thing we have not a drop of data on is [the situation] where people have their liberty at stake and have been living with a lie for a long time,” she says. She notes that the only published studies on fMRI lie detection involve people telling trivial lies with no threat of consequences. No peer-reviewed studies exist on real world situations like the case before the Tennessee court. Moreover, subjects in the published studies typically had their brains scanned within a few days of lying about a fake crime, whereas Semrau’s alleged crimes began nearly 10 years before he was scanned.

I’d go even further than this, and point out that even if there were studies that looked at ecologically valid lying, it’s unlikely that we’d be able to make any reasonable determination as to whether or not a particular individual was lying about a particular event. For one thing, most studies deal with group averages and not single-subject prediction; you might think that a highly statistically significant difference between two conditions (e.g., lying and not lying) necessarily implies a reasonable ability to make predictions at the single-subject level, but you’d be surprised. Prediction intervals for individual observations are typically extremely wide even when there’s a clear pattern at the group level. It’s just easier to make general statements about differences between conditions or groups than it is about what state a particular person is likely to be in given a certain set of conditions.

There is, admittedly, an emerging body of literature that uses pattern classification to make predictions about mental states at the level of individual subjects, and accuracy in these types of application can sometimes be quite high. But these studies invariably operate on relatively restrictive sets of stimuli within well-characterized domains (e.g., predicting which word out of a set of 60 subjects are looking at). This really isn’t “mind reading” in the sense that most people (including most judges and jurors) tend to think of it. And of course, even if you could make individual-level predictions reasonably accurately, it’s not clear that that’s good enough for the courtroom. As a scientist, I might be thrilled if I could predict which of 10 words you’re looking at with 80% accuracy (which, to be clear, is currently a pipe dream in the context of studies of ecologically valid lying). But as a lawyer, I’d probably be very skeptical of another lawyer who claimed my predictions vindicated their client. The fact that increased anterior cingulate activation tends to accompany lying on average isn’t a good reason to convict someone unless you can be reasonably certain that increased ACC activation accompanies lying for that person in that context when presented with that bit of information. At the moment, that’s a pretty hard sell.

As an aside, the thing I find perhaps most curious about the whole movement to use fMRI scanners as lie detectors is that there are very few studies that directly pit fMRI against more conventional lie detection techniques–namely, the polygraph. You can say what you like about the polygraph–and many people don’t think polygraph evidence should be admissible in court either–but at least it’s been around for a long time, and people know more or less what to expect from it. It’s easy to forget that it only makes sense to introduce fMRI scans (which are decidedly costly) as evidence if they do substantially better than polygraphs. Otherwise you’re just wasting a lot of money for a fancy brain image, and you could have gotten just as much information by simply measuring someone’s arousal level as you yell at them about that bloodstained Cadillac that was found parked in their driveway on the night of January 7th. But then, maybe that’s the whole point of trying to introduce fMRI to the courtroom; maybe lawyers know that the polygraph has a tainted reputation, and are hoping that fancy new brain scanning techniques that come with pretty pictures don’t carry the same baggage. I hope that’s not true, but I’ve learned to be cynical about these things.

At any rate, the Science articles are well worth a read, and since the judge hasn’t yet decided whether or not to allow fMRI or not, the next couple of weeks should be interesting…

[hat-tip: Thomas Nadelhoffer]

green chile muffins and brains in a truck: weekend in albuquerque

I spent the better part of last week in Albuquerque for the Mind Research Network fMRI course. It’s a really well-organized 3-day course, and while it’s geared toward people without much background in fMRI, I found a lot of the lectures really helpful. It’s hard impossible to get everything right when you run an fMRI study; the magnet is very fickle and doesn’t like to do what you ask it to–and that assumes you’re asking it to do the right thing, which is also not so common. So I find I learn something interesting from almost every fMRI talk I attend, even when it’s stuff I thought I already knew.

Of course, since I know very little, there’s also almost always stuff that’s completely new to me. In this case, it was a series of lectures on independent components analysis (ICA) of fMRI data, focusing on Vince Calhoun‘s group’s implementation of ICA in the GIFT toolbox. It’s a beautifully implemented set of tools that offer a really powerful alternative to standard univariate analysis, and I’m pretty sure I’ll be using it regularly from now on. So the ICA lectures alone were worth the price of admission. (In the interest of full disclosure, I should note that my post-doc mentor, Tor Wager, is one of the organizers of the MRN course, and I wasn’t paying the $700 tab out of pocket. But I’m not getting any kickbacks to say nice things about the course, I promise.)

Between the lectures and the green chile corn muffins, I didn’t get to see much of Albuquerque (except from the air, where the urban sprawl makes the city seem much larger than its actual population of 800k people would suggest), so I’ll reserve judgment for another time. But the MRN itself is a pretty spectacular facility. Aside from a 3T Siemens Trio magnet, they also have a 1.5T mobile scanner built into a truck. It’s mostly used to scan inmates in the New Mexico prison system (you’ll probably be surprised to learn that they don’t let hardened criminals out of jail to participate in scientific experiments–so the scanner has to go to jail instead). We got a brief tour of the mobile scanner and it was pretty awesome. Which is to say, it beats the pants off my Honda.

There are also some parts of the course I don’t remember so well. Here’s a (blurry) summary of those parts, courtesy of Alex Shackman:

Scott, Tor, and me in Albuquerque
BlurryScott, BlurryTor, and BlurryTal: The Boulder branch of the lab, Albuquerque 2010 edition

fMRI becomes big, big science

There are probably lots of criteria you could use to determine the relative importance of different scientific disciplines, but the one I like best is the Largest Number of Authors on a Paper. Physicists have long had their hundred-authored papers (see for example this individual here; be sure to click on the “show all authors/affiliations” link), and with the initial sequencing and analysis of the human genome, which involved contributions from 452 different persons, molecular geneticists also joined the ranks of Officially Big Science. Meanwhile, us cognitive neuroscientists have long had to content ourselves with silly little papers that have only four to seven authors (maybe a dozen on a really good day). Which means, despite the pretty pictures we get to put in our papers, we’ve long had this inferiority complex about our work, and a nagging suspicion that it doesn’t really qualify as big science (full disclosure: so when I say “we”, I probably just mean “I”).

UNTIL NOW.

Thanks to the efforts of Bharat Biswal and 53 collaborators (yes, I counted) reported in a recent paper in PNAS, fMRI is now officially Big, Big Science. Granted, 54 authors is still small potatoes in physics-and-biology-land. And for all I know, there could be other fMRI papers with even larger author lists out there that I’ve missed.  BUT THAT’S NOT THE POINT. The point is, people like me now get to run around and say we do something important.

You might think I’m being insincere here, and that I’m really poking fun at ridiculously long author lists that couldn’t possibly reflect meaningful contributions from that many people. Well, I’m not. While I’m not seriously suggesting that the mark of good science is how many authors are on the paper, I really do think that the prevalence of long author lists in a discipline are an important sign of a discipline’s maturity, and that the fact that you can get several dozen contributors to a single paper means you’re seeing a level of collaboration across different labs that previously didn’t exist.

The importance of large-scale collaboration is one of the central elements of the new PNAS article, which is appropriately entitled Toward discovery science of human brain function. What Biswal et al have done is compile the largest publicly-accessible fMRI dataset on the planet, consisting of over 1,400 scans from 35 different centers. All of the data, along with some tools for analysis, are freely available for download from NITRC. Be warned though: you’re probably going to need a couple of terabytes of free space if you want to download the entire dataset.

You might be wondering why no one’s assembled an fMRI dataset of this scope until now; after all, fMRI isn’t that new a technique, having been around for about 20 years now. The answer (or at least, one answer) is that it’s not so easy–and often flatly impossible–to combine raw fMRI datasets in any straightforward way. The problem is that the results of any given fMRI study only really make sense in the context of a particular experimental design. Functional MRI typically measures the change in signal associated with some particular task, which means that you can’t really go about combining the results of studies of phonological processing with those of thermal pain and obtain anything meaningful (actually, this isn’t entirely true; there’s a movement afoot to create image-based centralized databases that will afford meta-analyses on an even more massive scale,  but that’s a post for another time). You need to ensure that the tasks people performed across different sites are at least roughly in the same ballpark.

What allowed Biswal et al  to consolidate datasets to such a degree is that they focused exclusively on one particular kind of cognitive task. Or rather, they focused on a non-task: all 1400+ scans in the 1000 Functional Connectomes Project (as they’re calling it) are from participants being scanned during the “resting state”. The resting state is just what it sounds like: participants are scanned while they’re just resting; usually they’re given no specific instructions other than to lie still, relax, and not fall asleep. The typical finding is that, when you contrast this resting state with activation during virtually any kind of goal-directed processing, you get widespread activation increases in a network that’s come to be referred to as the “default” or “task-negative” network (in reference to the fact that it’s maximally active when people are in their “default” state).

One of the main (and increasingly important) applications of resting state fMRI data is in functional connectivity analyses, which aim to identify patterns of coactivation across different regions rather than mean-level changes associated with some task. The fundamental idea is that you can get a lot of traction on how the brain operates by studying how different brain regions interact with one another spontaneously over time, without having to impose an external task set. The newly released data is ideal for this kind of exploration, since you have a simply massive dataset that includes participants from all over the world scanned in a range of different settings using different scanners. So if you want to explore the functional architecture of the human brain during the resting state, this should really be your one-stop shop. (In fact, I’m tempted to say that there’s going to be much less incentive for people to collect resting-state data from now on, since there really isn’t much you’re going to learn from one sample of 20 – 30 people that you can’t learn from 1,400 people from 35+ combined samples).

Aside from introducing the dataset to the literature, Biswal et al also report a number of new findings. One neat finding is that functional parcellation of the brain using seed-based connectivity (i.e., identifying brain regions that coactivate with a particular “seed” or target region) shows marked consistency across different sites, revealing what Biswal et al call a “universal architecture”. This type of approach by itself isn’t particularly novel, as similar techniques have been used before. Bt no one’s done it on anything approaching this scale. Here’s what the results look like:

You can see that different seeds produce difference functional parcellations across the brain (the brighter areas denote ostensive boundaries).

Another interesting finding is the presence of gender and age differences in functional connectivity:

What this image shows is differences in functional connectivity with specific seed regions (the black dots) as a function of age (left) or gender (right). (The three rows reflect different techniques for producing the maps, with the upshot being that the results are very similar regardless of exactly how you do the analysis.) It isn’t often you get to see scatterplots with 1,400+ points in cognitive neuroscience, so this is a welcome sight. Although it’s also worth pointing out the inevitable downside of having huge sample sizes, which is that even tiny effects attain statistical significance. Which is to say, while the above findings are undoubtedly more representative of gender and age differences in functional connectivity than anything else you’re going to see for a long time, notice that they’re they’re very small effects (e.g., in the right panels, you can see that the differences between men and women are only a fraction of a standard deviation in size, despite the fact that these regions are probably selected because they show some of the “strongest” effects). That’s not meant as a criticism; it’s actually a very good thing, in that these modest effects are probably much closer to the truth than what previous studies have reported. Such findings should serve as an important reminder that most of the effects identified by fMRI studies are almost certainly massively inflated by small sample size (as I’ve discussed before here and in this paper).

Anyway, the bottom line is that if you’ve ever thought to yourself, “gee, I wish I could do cutting-edge fMRI research, but I really don’t want to leave my house to get a PhD; it’s almost lunchtime,” this is your big chance. You can download the data, rejoice in the magic that is the resting state, and bathe yourself freely in functional connectivity. The Biswal et al paper bills itself as “a watershed event in functional imaging,” and it’s hard to argue otherwise. Researchers now have a definitive data set to use for analyses of functional connectivity and the resting state, as well as a model for what other similar data sets might look like in the future.

More importantly, with 54 authors on the paper, fMRI is now officially big science. Prepare to suck it, Human Genome Project!

ResearchBlogging.orgBiswal, B., Mennes, M., Zuo, X., Gohel, S., Kelly, C., Smith, S., Beckmann, C., Adelstein, J., Buckner, R., Colcombe, S., Dogonowski, A., Ernst, M., Fair, D., Hampson, M., Hoptman, M., Hyde, J., Kiviniemi, V., Kotter, R., Li, S., Lin, C., Lowe, M., Mackay, C., Madden, D., Madsen, K., Margulies, D., Mayberg, H., McMahon, K., Monk, C., Mostofsky, S., Nagel, B., Pekar, J., Peltier, S., Petersen, S., Riedl, V., Rombouts, S., Rypma, B., Schlaggar, B., Schmidt, S., Seidler, R., Siegle, G., Sorg, C., Teng, G., Veijola, J., Villringer, A., Walter, M., Wang, L., Weng, X., Whitfield-Gabrieli, S., Williamson, P., Windischberger, C., Zang, Y., Zhang, H., Castellanos, F., & Milham, M. (2010). Toward discovery science of human brain function Proceedings of the National Academy of Sciences, 107 (10), 4734-4739 DOI: 10.1073/pnas.0911855107

functional MRI and the many varieties of reliability

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”” paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a “˜reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered “˜excellent’, results between 0.59 and 0.75 be considered “˜good’, results between .40 and .58 be considered “˜fair’, and results lower than 0.40 be considered “˜poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences