Still not selective: comment on comment on comment on Lieberman & Eisenberger (2015)

In my last post, I wrote a long commentary on a recent PNAS article by Lieberman & Eisenberger claiming to find evidence that the dorsal anterior cingulate cortex is “selective for pain” using my Neurosynth framework for large-scale fMRI meta-analysis. I argued that nothing about Neurosynth supports any of L&E’s major conclusions, and that they made several major errors of inference and analysis. L&E have now responded in detail on Lieberman’s blog. If this is the first you’re hearing of this exchange, and you have a couple of hours to spare, I’d suggest proceeding in chronological order: read the original article first, then my commentary, then L&E’s response then this response to the response (if you really want to leave no stone unturned, you could also read Alex Shackman’s commentary, which focuses on anatomical issues). If you don’t have that kind of time on your hands, just read on and hope for the best, I guess.

Before I get to the substantive issues, let me say that I appreciate L&E taking the time to reply to my comments in detail. I recognize that they have other things they could be doing (as do I), and I think their willingness to engage in this format sets an excellent example as the scientific community continues to move rapidly towards more open, rapid, and interactive online scientific discussion. I would encourage readers to weigh in on the debate themselves or raise any questions they feel haven’t been addressed (either here on on Lieberman’s blog).

With that said, I have to confess that I don’t think my view is any closer to L&E’s than it previously was. I disagree with L&E’s suggestions that we actually agree on more than I thought in my original post; if anything, I think the opposite is true. However, I did find L&E’s response helpful inasmuch as it helped me better understand where their misunderstandings of Neurosynth lie.

In what follows, I provided a detailed rebuttal to L&E’s response. I’ll warn you right now that this will be a very long and fairly detail-oriented post. In a (probably fruitless) effort to minimize reader boredom, I’ve divided my response into two sections, much as L&E did. In the first section, I summarize what I see as the two most important points of disagreement. In the second part, I quote L&E’s entire response and insert my own comments in-line (essentially responding email-style). I recognize that this is a rather unusual thing to do, and it makes for a decidedly long read (the post clocks in at over 20,000 words, though much of that is quotes from L&E’s response). but I did it this way because, frankly, I think L&E badly misrepresented much of what I said in my last post. I want to make sure the context is very clear to readers, so I’m going to quote the entirety of each of L&E’s points before I respond to them, so that at the very least I can’t be accused of quoting them out of context.

The big issues: reverse inference and selectivity

With preliminaries out of the way, let me summarize what I see as the two biggest problems with L&E’s argument (though, if you make it to the second half of this post, you’ll see that there are many other statistical and interpretational issues that are pretty serious in their own right). The first concerns their fundamental misunderstanding of the statistical framework underpinning Neurosynth, and its relation to reverse inference. The second concerns their use of a definition of selectivity that violates common sense and can’t possibly support their claim that “the dACC is selective for pain”.

Misunderstandings about the statistics of reverse inference

I don’t think there’s any charitable way to say this, so I’ll just be blunt: I don’t think L&E understand the statistics behind the images Neurosynth produces. In particular, I don’t think they understand the foundational role that the notion of probability plays in reverse inference. In their reply, L&E repeatedly say that my concerns about their lack of attention to effect sizes (i.e., conditional probabilities) are irrelevant, because they aren’t trying to make an argument about effect sizes. For example:

TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.

This captures perhaps the crux of L&E’s misunderstanding about both Neurosynth and reverse inference. Their argument here is basically that they don’t care about the actual probability of a term being used conditional on a particular pattern of activation; they just want to know that there’s “support for the reverse inference”. Unfortunately, it doesn’t work that way. The z-scores produced by Neurosynth (which are just transformations of p-values) don’t provide a direct index of the support for a reverse inference. What they measure is what p-values always measure: the probability of observing a result as extreme as the one observed under the assumption that the null of no effect is true. Conceptually, we can interpret this as a claim about the population-level association between a region and a term. Roughly, we can say that as z-scores increase, we can be more confident that there’s a non-zero (positive) relationship between a term and a brain region (though some Bayesians might want to take issue with even this narrow assertion). So, if all L&E wanted to say was, “there’s good evidence that there’s a non-zero association between pain and dACC activation across the population of published fMRI studies”, they would be in good shape. But what they’re arguing for is much stronger: they want to show that the dACC is selective for pain. And z-scores are of no use here. Knowing that there’s a non-zero association between dACC activation and pain tells us nothing about the level of specificity or selectivity of that association in comparison to other terms. If the z-score for the association between dACC activation and ‘pain’ occurrence is 12.4 (hugely statistically significant!), does that mean that the probability of pain conditional on dACC activation is closer to 95%, or to 25%? Does it tell us that dACC activation is a better marker of pain than conflict, vision, or memory? We don’t know. We literally have no way to tell, unless we’re actually willing to talk about probabilities within a Bayesian framework.

To demonstrate that this isn’t just a pedantic point about what could in theory happen, and that the issue is in fact completely fundamental to understanding what Neurosynth can and can’t support, here are three different flavors of the Neurosynth maps for the “pain” map:

Neurosynth reverse inference z-scores and posterior probabilities. Top: z-scores for two-way association test. Middle: posterior probability of pain assuming an empirical prior. Bottom: posterior probability of assuming uniform prior (p(Pain) = 0.5).
Neurosynth reverse inference z-scores and posterior probabilities for the term “pain”. Top: z-scores for two-way association test. Middle: posterior probability of pain assuming an empirical prior. Bottom: posterior probability of assuming uniform prior (p(Pain) = 0.5).

The top row is the reverse inference z-score map available on the website. The values here are z-scores, and what they tell us (being simply transformations of p-values) is nothing more than what the probability would be of observing an association at least as extreme as the one we observe under the null hypothesis of no effect. The second and third maps are both posterior probability maps. They display the probability of a study using the term ‘pain’ when activation is observed at each voxel in the brain. These maps aren’t available on the website (for reasons I won’t get into here, though the crux of it is that they’re extremely easy to misinterpret, for reasons that may become clear below)—though you can easily generate them with the Neurosynth core tools if you’re so inclined.

The main feature of these two probability maps that should immediately jump out at you is how strikingly different their numbers are. In the first map (i.e., middle row), the probabilities of “pain” max out around 20%; in the second map (bottom row), they range from around 70% – 90%. And yet, here I am telling you that these are both posterior probability maps that tell us the probability of a study using the term “pain” conditional on that study observing activity at each voxel. How could this be? How could the two maps be so different, if they’re supposed to be estimates of the same thing?

The answer lies in the prior. In the natural order of things, different terms occur with wildly varying frequencies in the literature (remember that Neurosynth is based on extraction of words from abstracts, not direct measurement of anyone’s mental state!). “Pain” occurs in only about 3.5% of Neurosynth studies. By contrast, the term “memory” occurs in about 16% of studies. One implication of this is that, if we know nothing at all about the pattern of brain activity reported in a given study, we should already expect that study to be about five times more likely to involve memory than pain. Of course, knowing something about the pattern of brain activity should change our estimate. In Bayesian terminology, we can say that our prior belief about the likelihood of different terms gets updated by the activity pattern we observe, producing somewhat more informed posterior estimates. For example, if the hippocampus and left inferior frontal gyrus are active, that should presumably increase our estimate of “memory” somewhat; conversely, if the periaqueductal gray, posterior insula, and dACC are all active, that should instead increase our estimate of “pain”.

In practice, the degree to which the data modulate our Neurosynth-based beliefs is not nearly as extreme as you might expect. In the first posterior probability map above (labeled “empirical prior”), what you can see are the posterior estimates for “pain” under the assumption that pain occurs in about 3.5% of all studies—which is the actual empirical frequency observed in the Neurosynth database. Notice that the very largest probabilities we ever see—located, incidentally, in the posterior insula, and not in the dACC—max out around 15 – 20%. This is not to be scoffed at; it means that observing activation in the posterior insula implies approximately a 5-fold increase in the likelihood of “pain” being present (relative to our empirical prior of 3.5%). Yet, in absolute terms, the probability of “pain” is still very low. Based on these data, no one in their right mind should, upon observing posterior insula activation (let alone dACC, where most voxels show a probability no higher than 10%), draw the reverse inference that pain is likely to be present.

To make it even clearer why this inference would be unsupportable, here are posterior probabilities for the same voxels as above, but now plotted for several other terms, in addition to pain:

Posterior probability maps (empirical prior assumed) for selected Neurosynth terms.
Posterior probability maps (empirical prior assumed) for selected Neurosynth terms.

Notice how, in the bottom map (for ‘motor’, which occurs in about 18% of all studies in Neurosynth), the posterior probabilities in all of dACC are substantially higher for than for ‘pain’, even though z-scores in most of dACC show the opposite pattern. For ‘working memory’ and ‘reward’, the posterior probabilities are in the same ballpark as for pain (mostly around 8 – 12%). And for ‘fear’, there are no voxels with posterior probabilities above 5% anywhere, because the empirical prior is so low (only 2% of Neurosynth studies).

What this means is that, if you observe activation in dACC—a region which shows large z-scores for “pain” and much lower ones for “motor”—your single best guess as to what process might be involved (of the five candidates in the above figure) should be ‘motor’ by a landslide. You could also guess ‘reward’ or ‘working memory’ with about the same probability as ‘pain’. Of course, the more general message you should take away from this is that it’s probably a bad idea to infer any particular process on the basis of observed activity, given how low the posterior probability estimates for most terms are going to be. Put simply, it’s a giant leap to go from these results—which clearly don’t license anyone to conclude that the dACC is a marker of any single process—to concluding that “the dACC is selective for pain” and that pain represents the best psychological characterization of dACC function.

As if this isn’t bad enough, we now need to add a further complication to the picture. The analysis above assumes we have a good prior for terms like “pain” and “memory”. In reality, we have no reason to think that the empirical estimates of term frequency we get out of Neurosynth are actually good reflections of the real world. For all we know, it could be that pain processing is actually 10 times as common as it appears to be in Neurosynth (i.e., that pain is severely underrepresented in fMRI studies relative to its occurrence in real-world human brains). If we use the empirical estimates from Neurosynth as our priors—with all of their massive between-term variation—then, as you saw above, the priors will tend to overwhelm our posteriors. In other words, no amount of activation in pain-related regions would ever lead us to conclude that a study is about a low-frequency term like pain rather than a high-frequency term like memory or vision.

For this reason, when I first built Neurosynth, my colleagues and I made the deliberate decision to impose a uniform (i.e., 50/50) prior on all terms displayed on the Neurosynth website. This approach greatly facilitates qualitative comparison of different terms; but it necessarily does so by artificially masking the enormous between-term variability in base rates. What this means is that when you see a posterior probability like 85% for pain in the dACC in the third row of the pain figure above, the right interpretation of this is “if you pretend that the prior likelihood of a study using the term pain is exactly 50%, then your posterior estimate after observing dACC activation should now be 85%”. Is this a faithful representation of reality? No. It most certainly isn’t. And in all likelihood, neither is the empirical prior of 3.5%. But the problem is, we have to do something; Bayes’ rule has to have priors to work with; it can’t just conjure into existence a conditional probability for a term (i.e., P(Term|Activation)) without knowing anything about its marginal probability  (i.e., P(Term)). Unfortunately, as you can see in the above figure, the variation in the posterior that’s attributable to the choice of prior will tend to swamp the variation that’s due to observed differences in brain activity.

The upshot is, if you come into a study thinking that ‘pain’ is 90% likely to be occurring, then Neurosynth is probably not going to give you much reason to revise that belief. Conversely, if your task involves strictly visual stimuli, and you know that there’s no sensory stimulation at all—so maybe you feel comfortable setting the prior on pain at 1%—then no pattern of activity you could possibly see is going to lead you to conclude that there’s a high probability of pain. This may not be very satisfying, but hey, that’s life.

The interesting thing about all this is that, no matter what prior you choose for any given term, the Neurosynth z-score will never change. That’s because the z-score is a frequentist measure of statistical association between term occurrence and voxel activation. All it tells us is that, if the null of no effect were true, the data we observe would be very unlikely. This may or may not be interesting (I would argue that it’s not, but that’s for a different post), but it certainly doesn’t license a reverse inference like “dACC activation suggests that pain is present”. To draw the latter claim, you have to use a Bayesian framework and pick some sensible priors. No priors, no reverse inference.

Now, as I noted in my last post, it’s important to maintain a pragmatic perspective. I’m obviously not suggesting that the z-score maps on Neurosynth are worthless. If one’s goal is just to draw weak qualitative inferences about brain-cognition relationships, I think it’s reasonable to use Neurosynth reverse inference z-score maps for that purpose. For better or worse, the vast majority of claims researchers make in cognitive neuroscience are not sufficiently quantitative that it makes much difference whether the probability of a particular term occurring given some observed pattern of activation is 24% or 58%. Personally, I would argue that this is to the detriment of the field; but regardless, the fact remains that if one’s goal is simply to say something like “we think that the temporoparietal junction is associated with biological motion and theory of mind,” or “evidence suggests that the parahippocampal cortex is associated with spatial navigation,” I don’t see anything wrong with basing that claim on Neurosynth z-score maps. In marked contrast, however, Neurosynth provides no license for saying much stronger things like “the dACC is selective for pain” or suggesting that one can make concrete reverse inferences about mental processes on the basis of observed patterns of brain activity. If the question we’re asking is what are we entitled to conclude about the presence of pain when we observed significant activation in the dACC in a particular study?, the simple answer is: almost nothing.

Let’s now reconsider L&E’s statement—and by extension, their entire argument for selectivity—in this light. L&E say that their goal is not to compare effect sizes for different terms, but rather “to see whether there [is] accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.” But what could this claim possibly mean, if not something like “we want to know whether it’s safe to infer the presence of pain given the presence of dACC activation?” How could this possibly be anything other than a statement about probability? Are L&E really saying that, given a sufficiently high z-score for dACC/pain, it would make no difference to them at all if the probability of pain given dACC activation was only 5%, even if there were plenty of other terms with much higher conditional probabilities? Do they expect us to believe that, in their 2003 social pain paper—where they drew a strong reverse inference that social pain shares mechanisms with physical pain based purely on observation of dACC activation (which, ironically, wasn’t even in pain-related areas of dACC—it would have made no difference to their conclusion even if they’d known conclusively that dACC activation actually only reflects pain processing 5% of the time? Such a claim is absurd on its face.

Let me summarize this section by making the following points about Neurosynth. First, it’s possible to obtain almost any posterior probability for any term given activation in any voxel, simply by adjusting the prior probability of term occurrence. Second, a choice about the prior must be made; there is no “default” setting (well, there is on the website, but that’s only because I’ve already made the choice for you). Third, the choice of prior will tend to dominate the posterior—which is to say, if you’re convinced that there’s a high (or low) prior probability that your study involves pain, then observing different patterns of brain activity will generally not do nearly as much as you might expect to change your conclusions. Fourth, this is not a Neurosynth problem, it’s a reality problem. The fundamental fact of the matter is that we simply do not know with any reasonable certainty, in any given context, what the prior probability of a particular process occuring in our subjects’ head is. Yet, without that, we have little basis for drawing any kind of reverse inference when we observe brain activity in a given study.

If all this makes you think, “oh, this seems like it would make it almost impossible in practice to draw meaningful reverse inferences in individual studies,” well, you’re not wrong.

L&E’s PNAS paper, and their reply to my last post, suggests that they don’t appreciate any of these points. The fact of the matter is that it’s impossible to draw any reverse inference about an individual study unless one is willing to talk about probabilities. L&E don’t seem to understand this, because if they did, they wouldn’t feel comfortable saying that they don’t care about effect sizes, and that z-scores provide adequate support for reverse inference claims. In fact, they wouldn’t feel comfortable making any claim about the dACC’s selectivity for pain relative to other terms on the basis of Neurosynth data.

I want to be clear that I don’t think L&E’s confusion about these issues is unusual. The reality is that many of these core statistical concepts—both frequentist and Bayesian—are easy to misunderstand, even for researchers who rely on them on a day-to-day basis. By no means am I excluding myself from this analysis; I still occasionally catch myself making similar slips when explaining what the z-scores and conditional probabilities in Neurosynth mean—and I’ve been thinking about these exact ideas in this exact context for a pretty long time! So I’m not criticizing L&E for failing to correctly understand reverse inference and its relation to Neurosynth. What I’m criticizing L&E for is writing an entire paper making extremely strong claims about functional selectivity based entirely on Neurosynth results, without ensuring that they understand the statistical underpinnings of the framework, and without soliciting feedback from anyone who might be in a position to correct their misconceptions. Personally, if I were in their position, I would move to retract the paper. But I have no control over that. All I can say is that it’s my informed opinion—as the creator of the software framework underlying all of L&E’s analyses—that the conclusions they draw in their paper are not remotely supported by any data that I’ve ever seen come out of Neurosynth.

On ‘strong’ vs. ‘weak’ selectivity

The other major problem with L&E’s paper, from my perspective, lies in their misuse of the term ‘selective’. In their response, L&E take issue with my criticism of their claim that they’ve shown the dACC to be selective for pain. They write:

Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word, with the strong form entailing further constraints on what constitutes an effect being selective. TY writes in his blog: “it’s one thing to use Neurosynth to support a loose claim like “some parts “¨of the dACC are preferentially associated with pain“, and quite another to claim that the dACC is selective for pain, that virtually nothing else activates dACC“. The last part there gets at what TY thinks we mean by selective and what we would call the strong form of selectivity.

L&E respectively define these strong and weak forms of selectivity as follows:

Selectivitystrong: The dACC is selective for pain, if pain and only pain activates the dACC.

Selectivityweak: The dACC is selective for pain, if pain is a more reliable source of dACC activation than the other terms of interest (executive, conflict, salience).

They suggest that I accused them of claiming ‘strong’ selectivity when they were really just making the much weaker claim that dACC activation is more strongly associated with dACC activation than with other terms. I disagree with this characterization. I’ll come back to what I meant by ‘selective’ in a bit (I certainly didn’t assume anything like L&E’s strong definition). But first, let’s talk about L&E’s ‘weak’ notion of selectivity, which in my view is at odds with any common-sense understanding of what ‘selective’ means, and would have an enormously destructive effect on the field if it were to become widely used.

The fundamental problem with the suggestion that we can say dACC is pain-selective if “it’s a more reliable source of dACC activation than the other terms of interest” is that this definition provides a free pass for researchers to make selectivity claims about an extremely large class of associations, simply by deciding what is or isn’t of interest in any given instance. L&E claim to be “interested” in executive control, conflict, and salience. This seems reasonable enough; after all, these are certainly candidate functions that people have discussed at length in the literature. The problem lies with all the functions L&E don’t seem to be interested in: e.g., fear, autonomic control, or reward—three other processes that many researchers have argued the dACC is crucially involved in, and that demonstrably show robust effects in dACC in Neurosynth. If we take L&E’s definition of weak selectivity at face value, we find ourselves in the rather odd position of saying that one can use Neurosynth to claim that a region is “selective” for a particular function just as long as it’s differentiable from some other very restricted set of functions. Worse still, one apparently does not have to justify the choice of comparison functions! In their PNAS paper, L&E never explain why they chose to focus only on three particular ACC accounts that don’t show robust activation in dACC in Neurosynth, and ignored several other common accounts that do show robust activation.

If you think this is a reasonable way to define selectivity, I have some very good news for you. I’ve come up with a list of other papers that someone could easily write (and, apparently, publish in a high-profile journal) based entirely on results you can obtain from the Neurosynth websites. The titles of these papers (and you could no doubt come up with many more) include:

  • “The TPJ is selective for theory of mind”
  • “The TPJ is selective for biological motion”
  • “The anterior insula is selective for inhibition”
  • “The anterior insula is selective for orthography”
  • “The VMPFC is selective for autobiographical memory”
  • “The VMPFC is selective for valuation”
  • “The VMPFC is selective for autonomic control”
  • “The dACC is selective for fear”
  • “The dACC is selective for autonomic control”
  • “The dACC is selective for reward”

These are all interesting-sounding articles that I’m sure would drum up considerable interest and controversy. And the great thing is, as long as you’re careful about what you find “interesting” (and you don’t have to explicitly explain yourself in the paper!), Neurosynth will happily support all of these conclusions. You just need to make sure not to include any comparison terms that don’t fit with your story. So, if you’re writing a paper about the VMPFC and valuation, make sure you don’t include autobiographical memory as a control. And if you’re writing about theory of mind in the TPJ, it’s probably best to not find biological motion interesting.

Now, you might find yourself thinking, “how could it make sense to have multiple people write different papers using Neurosynth, each one claiming that a given region is ‘selective’ for a variety of different processes? Wouldn’t that sort of contradict any common-sense understanding of what the term ‘selective’ means?” My own answer would be “yes, yes it would”. But L&E’s definition of “weak selectivity”—and the procedures they use in their paper—allow for multiple such papers to co-exist without any problem. Since what counts as an “interesting” comparison condition is subjective—and, if we take L&E’s PNAS example as a model, one doesn’t even need to explicitly justify the choices one makes—there’s really nothing stopping anyone from writing any of the papers I suggested above. Following L&E’s logic, a researcher who favored a fear-based account of dACC could simply select two or three alternative processes as comparison conditions—say, sustained attention and salience—do all of the same analyses L&E did (pretending for the moment that those analyses are valid, which they aren’t), and conclude that the dACC is selective for fear. It really is that easy.

In reality, I imagine that if L&E came across an article claiming that Neurosynth shows that the dACC is selective for fear, I doubt they’d say “well, I guess the dACC is selective for fear. Good to know.” I suspect they would (quite reasonably) take umbrage at the fear paper’s failure to include pain as a comparison condition in the analysis. Yet, by their own standards, they’d have no real basis for any complaint. The fear paper’s author could simply, say, “pain’s not interesting to me,” and that would be that. No further explanation necessary.

Perhaps out of recognition that there’s something a bit odd about their definition of selectivity, L&E try to prime our intuition that their usage is consistent with the rest of the field. They point out that, in most experimental fMRI studies claiming evidence for selectivity, researchers only ever compare the target stimulus or process to a small number of candidates. For example, they cite a Haxby commentary on a paper that studied category specificity in visual cortex:

From Haxby (2006): “numerous small spots of cortex were found that respond with very high selectivity to faces. However, these spots were intermixed with spots that responded with equally high selectivity to the other three categories.“

Their point is that nobody expects ‘selective’ here to mean that the voxel in question responds to only that visual category and no other stimulus that could conceivably have been presented. In practice, people take ‘selective’ to mean “showed a greater response to the target category than to other categories that were tested”.

I agree with L&E that Haxby’s usage of the term ‘selective’ here is completely uncontroversial. The problem is, the study in question is a lousy analogy for L&E’s PNAS paper. A much better analogy would be a study that presented 10 visual categories to participants, but then made a selectivity claim in the paper’s title on the basis of a comparison between the target category and only 2 other categories, with no explanation given for excluding the other 7 categories, even though (a) some of those 7 categories were well known to also be associated with the same brain region, and (b) strong activation in response to some of those excluded categories was clearly visible in a supplementary figure. I don’t know about L&E, but I’m pretty sure that, presented with such a paper, the vast majority of cognitive neuroscientists would want to say something like, “how can you seriously be arguing that this part of visual cortex responds selectively to spheres, when you only compared spheres with faces and houses in the main text, and your supplemental figure clearly shows that the same region responds strongly to cubes and pyramids as well? Shouldn’t you maybe be arguing that this is a region specialized for geometric objects, if anything?” And I doubt anyone would be very impressed if the authors’ response to this critique was “well, it doesn’t matter what else we’re not focusing on in the paper. We said this region is sphere-selective, which just means it’s more selective than a couple of other stimulus categories people have talked about. Pyramids and cubes are basically interchangeable with spheres, right? What more do you want from us?”

I think it’s clear that there’s no basis for making a claim like “the dACC is selective for pain” when one knows full well that at least half a dozen other candidate functions all reliably activate the dACC. As I noted in my original post, the claim is particularly egregious in this case, because it’s utterly trivial to generate a ranked list of associations for over 3,000 different terms in Neurosynth. So it’s not even as if one needs to think very carefully about which conditions to include in one’s experiment, or to spend a lot of time running computationally intensive analyses. L&E were clearly aware that a bunch of other terms also activated dACC; they briefly noted as much in the Discussion of their paper. What they didn’t explain is why this observation didn’t lead them to seriously revise their framing. Given what they knew, there were at least two alternative articles they could have written that wouldn’t have violated common sense understanding of what the term ‘selective’ means. One might have been titled something like “Heterogeneous aspects of dACC are preferentially associated with pain, autonomic control, fear, reward, negative affect, and conflict monitoring”. The other might have been titled “the dACC is preferentially associated with X-related processes”—where “X” is some higher-order characterization that explains why all of these particular processes (and not others) are activated in dACC. I have no idea whether either of these papers would have made it through peer review at PNAS (or any other journal), but at the very least they wouldn’t have been flatly contradicted by Neurosynth results.

To be fair to L&E, while they didn’t justify their exlcusion of terms like fear and autonomic control in the PNAS paper, they did provide some explanation in their reply to my last post. Here’s what they say:

TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.“ And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.

Essentially, their response is: “it didn’t make sense for us to include fear or autonomic control, because these functions are compatible with the underlying role we think the dACC is playing in pain”. This is not compelling, for three reasons. First, it’s a bait-and-switch. L&E’s paper isn’t titled “the dACC is selective for a family of distress-related processes”, it’s titled “the dACC is selective for pain“. One cannot publish a paper purporting to show that the dACC is selective for pain, and arguing that pain is the single best psychological characterization of its role in cognition, and then, in a section of their Discussion that they admit is the “most speculative” part of the paper, essentially say, “just kidding–we don’t think it’s really doing pain per se, we think it’s a much more general set of functions. But we don’t have any real evidence for that.”

Second, it’s highly uncharitable for L&E to spontaneously lump alternative accounts of dACC function like fear/avoidance, autonomic control, and bodily orientation in with their general “distress-related” account, because proponents of many alternative views of dACC function have been very explicit in saying that they don’t view these functions as fundamentally affective (e.g., Vogt and colleagues view posterior dACC as a premotor region). While L&E may themselves believe that pain, fear, and autonomic control in dACC all reflect some common function, that’s an extremely strong claim that requires independent evidence, and is not something that they’re entitled to simply assume. A perfectly sensible alternative is that these are actually dissociable functions with only partially overlapping spatial representations in dACC. Since the terms themselves are distinct in Neurosynth, that should be L&E’s operating assumption until they provide evidence for their stronger claim that there’s some underlying commonality. Nothing about this conclusion simply falls out of the data in advance.

Third, let me reiterate the point I made above about L&E’s notion of ‘weak selectivity’: if we take at face value L&E’s claim that fear and autonomic control don’t need to be explicitly considered because they could be interpreted alongside pain under a common account, then they’re effectively conceding that it would have made just as much sense to publish a paper titled “the dACC is selective for fear” or “the dACC is selective for autonomic control” that relegated the analysis of the term “pain” to a supplementary figure. In the paper’s body, you would find repeated assertions that the authors  have shown that autonomic control is the “best general psychological account of dACC function”. When pressed as to whether this was a reasonable conclusion, the authors would presumably defend their decision to ignore pain as a viable candidate by saying things like, “well, sure pain also activates the dACC; everyone knows that. But that’s totally consistent with our autonomic control account, because pain produces autonomic outputs! So we don’t need to consider that explicitly.”

I confess to some skepticism that L&E would simply accept such a conclusion without any objection.

Before moving on, let me come full circle and offer a definition of selectivity that I think is much more workable than either of the ones L&E propose, and is actually compatible with the way people use the term ‘selective’ more broadly in the field:

Selectivityrealistic: A brain region can be said to be ‘selective’ for a particular function if it (i) shows a robust association with that function, (ii) shows a negligible association with all other readily available alternatives, and (iii) the authors have done due diligence in ensuring that the major candidate functions proposed in the literature are well represented in their analysis.

Personally, I’m not in love with this definition. I think it still allows researchers to make claims that are far too strong in many cases. And it still allows for a fair amount of subjectivity in determining what gets to count as a suitable control—at least in experimental studies where researchers necessarily have to choose what kinds of conditions to include. But I think this definition is more or less in line with the way most cognitive neuroscientists expect each other to use the term. It captures the fact that most people would feel justifiably annoyed if someone reported a “selective” effect in one condition while failing to acknowledge that 4 other unreported conditions showed the same effect. And it also captures the notion that researchers should be charitable to each other: if I publish a paper claiming that the so-called fusiform ‘face’ area is actually selective for houses, based on a study that completely failed to include a face condition, no one is going to take my claim of house selectivity seriously. Instead, they’re going to conclude that I wasn’t legitimately engaging with other people’s views.

In the context of Neurosynth—where one has 3,000 individual terms or several hundred latent topics at their disposal—this definition makes it very clear that researchers who want to say that a region is selective for something have an obligation to examine the database comprehensively, and not just to cherry-pick a couple of terms for analysis. That is what I meant when I said that L&E need to show that “virtually nothing else activates dACC”. I wasn’t saying that they have to show that no other conceivable process reliably activates the dACC (which would be impossible, as they observe), but simply that they need to show that no non-synonymous terms in the Neurosynth database do. I stand by this assertion. I see no reason why anyone should accept a claim of selectivity based on Neurosynth data if just a minute or two of browsing the Neurosynth website provides clear-cut evidence that plenty of other terms also reliably activate the same region.

To sum up, nothing L&E say in their paper gives us any reason to think that the dACC is selective for pain (even if we were to ignore all the problems with their understanding of reverse inference and allow them to claim selectivity based on inappropriate statistical tests). I submit that no definition of ‘selective’ that respects common sense usage of the term, and is appropriately charitable to other researchers, could possibly have allowed L&E to conclude that dACC activity is “selective” for pain when they knew full well that fear, autonomic control, and reward all also reliably activated the dACC in Neurosynth.

Everything else

Having focused on what I view as the two overarching issues raised by L&E’s reply, I now turn to comprehensively addressing each of their specific claims. As I noted at the outset, I recognize this is going to make for slow reading. But I want to make sure I address L&E’s points clearly and comprehensively, as I feel that they blatantly mischaracterized what I said in my original post in many cases. I don’t actually recommend that anyone read this entire section linearly. I’m writing it primarily as a reference—so that if you think there were some good points L&E made in their reply to my original post, you can find those points by searching for the quote, and my response will be directly below.

Okay, let’s begin.

Tal Yarkoni (hereafter, TY), the creator of Neurosynth, has now posted a blog (here (link is external)) suggesting that pretty much all of our claims are either false, trivial, or already well-known. While this response was not unexpected, it’s disappointing because we love Neurosynth and think it’s a powerful tool for drawing exactly the kinds of conclusions we’ve drawn.

I’m surprised to hear that my response was not unexpected. This would seem to imply that L&E had some reason to worry that I wouldn’t approve of the way they were using Neurosynth, which leads me to wonder why they didn’t solicit my input ahead of time.

While TY is the creator of Neurosynth, we don’t think that means he has the last word when it comes to what is possible to do with it (nor does he make this claim). In the end, we think there may actually be a fair bit of agreement between us and TY. We do think that TY has misunderstood some of our claims (section 1 below) and failed to appreciate the significance and novelty of our actual claims (sections 2 and 4). TY also thinks we should have used different statistical analyses than we did, but his critique assumes we had a different question than the one we really had (section 5).

I agree that I don’t have the last word, and I encourage readers to consider both L&E’s arguments and mine dispassionately. I don’t, however, think that there’s a fair bit of agreement between us. Nor do I think I misunderstood L&E’s claim or failed to appreciate their significance or novelty. And, as I discuss at length both above and below, the problem is not that L&E are asking a different question than I think, it’s that they don’t understand that the methods they’re using simply can’t speak to the question they say they’re asking.

1. Misunderstandings (where we sort of probably agree)

We think a lot of the heat in TY’s blog comes from two main misunderstandings of what we were trying to accomplish. The good news (and we really hope it is good news) is that ultimately, we may actually mostly agree on both of these points once we get clear on what we mean. The two issues have to do with the use of the term “selective“ and then why we chose to focus on the four categories we did (pain, executive, conflict, salience) and not others like fear and autonomic.

Misunderstanding #1: Selectivity. Regarding the term selective, I suppose we could say there’s a strong form and a weak form of the word…

I’ve already addressed this in detail at the beginning of this post, so I’ll skip the next few paragraphs and pick up here:

We mean this in the same way that Haxby and lots of others do. We never give a technical definition of selectivity in our paper, though in the abstract we do characterize our results as follows:

“Results clearly indicated that the best psychological description of dACC function was related to pain processing—not executive, conflict, or salience processing.“

Thus, the context of what comparisons our selectivity refers to is given in the same sentence, right up front in the abstract. In the end, we would have been just as happy if “selectivity“ in the title was replaced with “preferentially activated“. We think this is what the weak form of selectivity entails and it is really what we meant. We stress again, we are not familiar with researchers who use the strong form of selectivity. TY’s blog is the first time we have encountered this and was not what we meant in the paper.

I strongly dispute L&E’s suggestion that the average reader will conclude from the above sentence that they’re clearly analyzing only 4 terms. Here’s the sentence in their abstract that directly precedes the one they quote:

Using Neurosynth, an automated brainmapping database [of over 10,000 functional MRI (fMRI) studies], we performed quantitative reverse inference analyses to explore the best general psychological account of the dACC function P(Ψ processjdACC activity).

It seems quite clear to me that the vast majority of readers are going to parse the title and abstract of L&E’s paper as implying a comprehensive analysis to find the best general psychological account of dACC function, and not “the best general psychological account if you only consider these 4 very specific candidates”. Indeed, I have trouble making any sense of the use of the terms “best” and “general” in this context, if what L&E meant was “a very restricted set of possibilities”. I’ll also note that in five minutes of searching the literature, I couldn’t find any other papers with titles or abstracts that make nearly as strong a claim about anterior cingulate function as L&E’s present claims about pain. So I reject the idea that their usage is par for the course. Still, I’m happy to give them the benefit of the doubt and accept that they truly didn’t realize that their wording might lead others to misinterpret their claims. I guess the good news is that, now that they’re aware of the potential confusion claims like this can cause, they will surely be much more circumspect in the titles and abstracts of their future papers.

Before moving on, we want to note that in TY’11 (i.e. the Yarkoni et al., 2011 paper announcing Neurosynth), the weak form of selectivity is used multiple times. In the caption for Figure 2, the authors refer to “regions in c were selectively associated with the term“ when as far as we can tell, they are talking only about the comparison of three terms (working memory, emotion, pain). Similarly on p. 667 the authors write “However, the reverse inference map instead implicated the anterior prefrontal cortex and posterior parietal cortex as the regions that were most selectively activated by working memory tasks.“ Here again, the comparison is to emotion and pain, and the authors are not claiming selectivity relative to all other psychological processes in the Neurosynth database. If it is fair for Haxby, Botvinick, and the eminent coauthors of TY’11 to use selectivity in this manner, we think it was fine for us as well.

I reject the implication of equivalence here. I think the scope of the selectivity claim I made in the figure caption in question is abundantly clear from the immediate context, and provides essentially no room for ambiguity. Who would expect, in a figure with 3 different maps, the term ‘selective’ to mean anything other than ‘for this one and not those two’? I mean, if L&E had titled their paper “pain preferentially activates the dACC relative to conflict, salience, or executive control”, and avoided saying that they were proposing the “best general account” of psychological function in dACC, I wouldn’t have taken issue with their use of the term ‘selective’ in their manuscript either, because the scope would have been equally clear. Conversely, if I had titled my 2011 paper “the dACC shows no selectivity for any cognitive process”, and said, in the abstract, something like “we show that there is no best general psychological function of the dACC–not pain, working memory, or emotion”, I would have fully expected to receive scorn from others.

That said, I’m willing to put my money where my mouth is. If a few people (say 5) write in to say (in the comments below, on twitter, or by email) that they took the caption in Figure 2 of my 2011 paper to mean anything other than “of these 3 terms, only this one showed an effect”, I’ll happily send the journal a correction. And perhaps, L&E could respond in kind by commiting to changing the title of their manuscript to something like “the dACC is preferentially active for pain relative to conflict, salience or executive control” if 5 people write in to say that they interpreted L&E’s claims as being much more global than L&E suggest they are. I encourage readers to use the comments below to clarify how they understood both of these selectivity claims.

We would also point readers to the fullest characterization of the implication of our results on p. 15253 of the article:

“The conclusion from the Neurosynth reverse inference maps is unequivocal: The dACC is involved in pain processing. When only forward inference data were available, it was reasonable to make the claim that perhaps dACC was not involved in pain per se, but that pain processing could be reduced to the dACC’s “real“ function, such as executive processes, conflict detection, or salience responses to painful stimuli. The reverse inference maps do not support any of these accounts that attempt to reduce pain to more generic cognitive processes.“

We think this claim is fully defensible and nothing in TY’s blog contradicts this. Indeed, he might even agree with it.

This claim does indeed seem to me largely unobjectionable. However, I’m at a loss to understand how the reader is supposed to know that this one very modest sentence represents “the fullest characterization” of the results in a paper replete with much stronger assertions. Is the reader supposed to, upon reading this sentence, retroactively ignore all of the other claims—e.g., the title itself, and L&E’s repeated claim throughout the paper that “the best psychological interpretation of dACC activity is in terms of pain processes”?

*Misunderstanding #2: We did not focus on fear, emotion, and autonomic accounts*. TY criticizes us several times for not focusing on other accounts of the dACC including fear, emotion, and autonomic processes. We agree with TY that these kind of processes are relevant to dACC function. Indeed, we were writing about the affective functions of dACC (Eisenberger & Lieberman, 2004) when the rest of the field was saying that the dACC was purely for cognitive processes (Bush, Luu, & Posner, 2000). We have long posited that one of the functions of the dACC was to sound an alarm when certain kinds of conflict arise. We think the dACC is evoked by a variety of distress-related processes including pain, fear, and anxiety. As Eisenberger (2015) wrote: “Interestingly, the consistency with which the dACC is linked with fear and anxiety is not at odds with a role for this region in physical and social pain, as threats of physical and social pain are key elicitors of fear and anxiety.“ And the outputs of this alarm process are partially autonomic in nature. Thus, we don’t think of fear and autonomic accounts as in opposition to the pain account, but rather in the same family of explanations. We think this class of dACC explanations stands in contrast to the cognitive explanations that we did compare to (executive, conflict, salience). Most of this, and what is said below, is discussed in Naomi Eisenberger’s (2015) Annual Review chapter.

I addressed this in detail above, in the section on “selectivity”.

We speak to some but not all of this in the paper. On p. 15254, we revisit our neural alarm account and write “Distress-related emotions (“negative affect“ “distress“ “fear“) were each linked to a dACC cluster, albeit much smaller than the one associated with “pain“.“ While we could have said more explicitly that pain is in this distress-related category, we have written about this several times before and assumed this would be understood by readers.

There is absolutely no justification for assuming this. The community of people who might find a paper titled “the dorsal anterior cingulate cortex is selective for pain” interesting is surely at least an order of magnitude larger than the community of people who are familiar with L&E’s previous work on distress-related emotions.

So why did we focus on executive, conflict, and salience? Like most researchers, we are the products of our early (academic) environment. When we were first publishing on social pain, we were confused by the standard account of dACC function. A half century of lesion data and a decade of fMRI studies of pain pointed towards more evidence of the dACC’s involvement in distress-related emotions (pain & anxiety), yet every new paper about the dACC’s function described it in cognitive terms. These cognitive papers either ignored all of the pain and distress findings for dACC or they would redescribe pain findings as reducible to or just an instance of something more cognitive.

When we published our first social pain paper, the first rebuttal paper suggested our effects were really just due to “expectancy violation“ (Somerville et al., 2006), an account that was later invalidated (Kawamoto 2012). Many other cognitive accounts have also taken this approach to physical pain (Price 2000; Vogt, Derbyshire, & Jones, 2006).

Thus for us, the alternative to pain accounts of dACC all these years were conflict detection and cognitive control explanations. This led to the focus on the executive and conflict-related terms. In more recent years, several papers have attempted to explain away pain responses in the dACC as nothing more than salience processes (e.g Iannetti’s group) that have nothing to do with pain, and so salience became a natural comparison as well. We haven’t been besieged with papers saying that pain responses in the dACC are “nothing but“ fear or “nothing but“ autonomic processes, so those weren’t the focus of our analyses.

This is a informative explanation of L&E’s worldview and motivations. But it doesn’t justify ignoring numerous alternative accounts whose proponents very clearly don’t agree with L&E that their views can be explained away as “distress-related”. If L&E had written a paper titled “salience is not a good explanation of dACC function,” I would have happily agreed with their conclusion here. But they didn’t. They wrote a paper explicitly asserting that pain is the best psychological characterization of the dACC. They’re not entitled to conclude this unless they compare pain properly with a comprehensive set of other possible candidates—not just the ones that make pain look favorable.

We want to comment further on fear specifically. We think one of the main reasons that fear shows up in the dACC is because so many studies of fear use pain manipulations (i.e. shock administration) in the process of conditioning fear responses. This is yet another reason that we were not interested in contrasting pain and fear maps. That said, if we do compare the Z-scores in the same eight locations we used in the PNAS paper, the pain effect has more accumulated evidence than fear in all seven locations where there is any evidence for pain at all.

This is a completely speculative account, and no evidence is provided for it. Worse, it’s completely invertible: one could just as easily say that pain shows up in the dACC because it invariably produces fear, or because it invariably elicits autonomic changes (frankly, it seems more plausible to me that pain almost always generates fear than that fear is almost always elicited by pain). There’s no basis for ruling out these other candidate functions a priori as being more causally important. This is simply question-begging.

Its interesting to us that TY does not in principle seem to like us trying to generate some kind of unitary account of dACC writing “There’s no reason why nature should respect our human desire for simple, interpretable models of brain function.“ Yet, TY then goes on to offer a unitary account more to his liking. He highlights Vogt’s “four-region“ model of the cingulate writing “I’m especially partial to the work of Brent Vogt“¦“. In Vogt’s model, the aMCC appears to be largely the same region as what we are calling dACC. Although the figure shown by TY doesn’t provide anatomical precision, in other images, Vogt shows the regions with anatomical boundaries. Rotge et al. (2015) used such an image from Vogt (2009) to estimate the boundaries of aMCC as spanning 4.5 ≤ y ≤ 30 which is very similar to our dACC anterior/posterior boundaries of 0 ≤ y ≤ 30) (see Figure below). Vogt ascribes the function of avoidance behavior to this region – a pretty unitary description of the region that TY thinks we should avoid unitary descriptions of.

There is no charitable way to put it: this is nothing short of a gross misrepresentation of what I said about the Vogt account. As a reminder, here’s what I actually wrote in my post:

I’m especially partial to the work of Brent Vogt and colleagues (e.g., Vogt (2005); Vogt & Sikes, 2009), who have suggested a division within the anterior mid-cingulate cortex (aMCC; a region roughly co-extensive with the dACC in L&E’s nomenclature) between a posterior region involved in bodily orienting, and an anterior region associated with fear and avoidance behavior (though the two functions overlap in space to a considerable degree) … the Vogt characterization of dACC/aMCC … fits almost seamlessly with the Neurosynth results displayed above (e.g., we find MCC activation associated with pain, fear, autonomic, and sensorimotor processes, with pain and fear overlapping closely in aMCC). Perhaps most importantly, Vogt and colleagues freely acknowledge that their model—despite having a very rich neuroanatomical elaboration—is only an approximation. They don’t attempt to ascribe a unitary role to aMCC or dACC, and they explicitly recognize that there are distinct populations of neurons involved in reward processing, response selection, value learning, and other aspects of emotion and cognition all closely interdigitated with populations involved in aspects of pain, touch, and fear. Other systems-level neuroanatomical models of cingulate function share this respect for the complexity of the underlying circuitry—complexity that cannot be adequately approximated by labeling the dACC simply as a pain region (or, for that matter, a “survival-relevance“ region).

I have no idea how L&E read this and concluded that I was arguing that we should simply replace the label “pain” with “fear”. I don’t feel the need to belabor the point further, because I think what I wrote is quite clear.

In the end though, if TY prefers a fear story to our pain story, we think there is some evidence for both of these (a point we make in our PNAS paper). We think they are in a class of processes that overlap both conceptually (i.e. distress-related emotions) and methodologically (i.e. many fear studies use pain manipulations to condition fear).

No, I don’t prefer a fear story. My view (which should be abundantly clear from the above quote) is that both a fear story and a pain story would be gross oversimplifications that shed more heat than light. I will, however, reiterate my earlier point (which L&E never responded to), which is that their PNAS paper provides no reason at all to think that the dACC is involved in distress-related emotion (indeed, they explicitly said that this was the most speculative part of the paper). If anything, the absence of robust dACC activation for terms like ‘disgust’, ’emotion’, and ‘social’ would seem to me like pretty strong evidence against a simplistic model of this kind. I’m not sure why L&E are so resistant to the idea that maybe, just maybe, the dACC is just too big a region to attach a single simple label to. As far as I can tell, they provide no defense of this assumption in either their paper or their reply.

After focusing on potential misunderstandings we want to turn to our first disagreement with TY. Near the end of his blog, TY surprised us by writing that the following conclusions can be reasonably drawn from Neurosynth analyses:

* “There are parts of dACC (particularly the more posterior aspects) that are preferentially activated in studies involving painful stimulation.“
* “It’s likely that parts of dACC play a greater role in some aspect of pain processing than in many other candidate processes that at various times have been attributed to dACC (e.g., monitoring for cognitive conflict)“

Our first response was “˜Wow. After pages and pages of criticizing our paper, TY pretty much agrees with what we take to be the major claims of our paper. Yes, his version is slightly watered down from what we were claiming, but these are definitely in the ballpark of what we believe.’

L&E omitted my third bullet point here, which was that “Many of the same regions of dACC that preferentially activate during pain are also preferentially activated by other processes or tasks—e.g., fear conditioning, autonomic arousal, etc.” I’m not sure why they left it out; they could hardly disagree with it either, if they want to stand by their definition of “weak selectivity”.

I’ll leave it to you to decide whether or not my conclusions are really just “watered down” versions “in the ballpark” of the major claims L&E make in their paper.

But then TY’s next statement surprised us in a different sort of way. He wrote

“I think these are all interesting and potentially important observations. They’re hardly novel“¦“.

We’ve been studying the dACC for more than a decade and wondered what he might have meant by this. We can think of two alternatives for what he might have meant:

* That L&E and a small handful of others have made this claim for over a decade (but clearly not with the kind of evidence that Neurosynth provides).

* That TY already used Neurosynth in 2011 to show this. In the blog, he refers to this paper writing “We explicitly noted that there is preferential activation for pain in dACC“.

I’m not sure what was confusing about what I wrote. Let’s walk through the three bullet points. The first one is clearly not novel. We’ve known for many years that many parts of dACC are preferentially active when people experience painful stimulation. As I noted in my last post, L&E explicitly appealed to this literature over a decade ago in their 2003 social pain paper. The second one is also clearly not novel. For example, Vogt and colleagues (among others) have been arguing for at least two decades now that the posterior aspects of dACC support pain processing in virtue of their involvement in processes (e.g., bodily orientation) that clearly preclude most higher cognitive accounts of dACC. The third claim isn’t novel either, as there has been ample evidence for at least a decade now that virtually every part of dACC that responds to painful stimulation also systematically responds to other non-nociceptive stimuli (e.g., the posterior dACC responds to non-painful touch, the anterior to reward, etc.). I pointed to articles and textbooks comprehensively reviewing this literature in my last post. So I don’t understand L&E’s surprise. Which of these three claims do they think is actually novel to their paper?

In either case, “they’re hardly novel“ implies this is old news and that everyone knows and believes this, as if we’re claiming to have discovered that most people have two eyes, a nose, and a mouth. But this implication could not be further from the truth.

No, that’s not what “hardly novel” implies. I think it’s fair to say that the claim that social pain is represented in the dACC in virtue of representations shared with physical pain is also hardly novel at this point, yet few people appear to know and believe it. I take ‘hardly novel’ to mean “it’s been said before multiple times in the published literature.”

There is a 20+ year history of researchers ignoring or explaining away the role of pain processing in dACC.

I’ll address the “explained away” part of this claim below, but it’s completely absurd to suggest that researchers have ignored the role of pain processing in dACC for 20 years. I don’t think I can do any better than link to Google Scholar, where the reader is invited to browse literally hundreds of articles that all take it as an established finding that the dACC is important for pain processing (and many of which have hundreds of citations from other articles).

When pain effects are mentioned in most papers about the function of dACC, it is usually to say something along the lines of “˜Pain effects in the dACC are just one manifestation of the broader cognitive function of conflict detection (or salience or executive processes)’. This long history is indisputable. Here are just a few examples (and these are all reasonable accounts of dACC function in the absence of reverse inference data):

* Executive account: Price’s 2000 Science paper on the neural mechanisms of pain assigns to the dACC the roles of “directing attention and assigning response priorities“
* Executive account: Vogt et al. (1996) says the dACC “is not a “˜pain centre’“ and “is involved in response selection“ and “response inhibition or visual guidance of responses“
* Conflict account: Botvinick et al. (2004) wrote that “the ACC might serve to detect events or internal states indicating a need to shift the focus of attention or strengthen top-down control ([4], see also [20]), an idea consistent, for example, with the fact that the ACC responds to pain “ (Botvinick et al. 2004)
* Salience account: Iannetti suggests the “˜pain matrix’ is a myth and in Legrain et al. (2011) suggests that the dACC’s responses to pain “could mainly reflect brain processes that are not directly related to the emergence of pain and that can be engaged by sensory inputs that do not originate from the activation of nociceptors.“

I’m not really sure what to make of this argument either. All of these examples clearly show that even proponents of other theories of dACC function are well aware of the association with pain, and don’t dispute it in any way. So L&E’s objection can’t be that other people just don’t believe that the dACC supports pain processing. Instead, L&E seem to dislike the idea that other theorists have tried to “explain away” the role of dACC in pain by appealing to other mechanisms. Frankly, I’m not sure what the alternative to such an approach could possibly be. Unless L&E are arguing that dACC is the neural basis of an integrated, holistic pain experience (whatever such a thing might mean), there presumably must be some specific computational operations going on within dACC that can be ascribed a sensible mechanistic function. I mean, even L&E themselves don’t take the dACC to be just about, well, pain. Their whole “distress-related emotion” story is itself intended to explain what it is that dACC actually does in relation to pain (since pretty much everyone accepts that the sensory aspects of pain aren’t coded in dACC).

The only way I can make sense of this “explained away” concern is if what L&E are actually objecting to is the fact that other researchers have disagreed or ignored their particular story about what the dACC does in pain—i.e., L&E’s view that the dACC role in pain is derived from distress-related emotion. As best I can tell, what bothers them is that other researchers fundamentally disagree with–and hence, don’t cite–their “distress-related emotion” account. Now, maybe this irritation is justified, and there’s actually an enormous amount of evidence out there in favor of the distress account that other researchers are willfully ignoring. I’m not qualified to speak to that (though I’m skeptical). What I do feel qualified to say is that none of the Neurosynth results L&E present in their paper make any kind of case for an affective account of pain processing in dACC. The most straightforward piece of evidence for that claim would be if there were a strong overlap between pain and negative affect activations in dACC. But we just don’t see this in Neurosynth. As L&E themselves acknowledge, the peak sectors of pain-related activation in dACC are in mid-to-posterior dACC, and affect-related terms only seem to reliably activate the most anterior aspects.

To be charitable to L&E, I do want to acknowledge one valuable point that they contribute here, which is that it’s clear that dACC function cannot be comprehensively explained by, say, a salience account or a conflict monitoring account. I think that’s a nice point (though I gather that some people who know much more about anatomy than I do are in the process of writing rebuttals to L&E that argue it’s not as nice as I think it is). The problem is, this argument can be run both ways. Meaning, much as L&E do a nice job showing that conflict monitoring almost certainly can’t explain activations in posterior dACC, the very maps they show make it clear that pain can’t explain all the other activations in anterior dACC (for reward, emotion, etc.). Personally, I think the sensible conclusion one ought to take away from all this is “it’s really complicated, and we’re not going to be able to neatly explain away all of dACC function with a single tidy label like ‘pain’.” L&E draw a different conclusion.

But perhaps this approach to dACC function has changed in light of TY’11 findings (i.e. Yarkoni et al. 2011). There he wrote “For pain, the regions of maximal pain-related activation in the insula and DACC shifted from anterior foci in the forward analysis to posterior ones in the reverse analysis.“ This hardly sounds like a resounding call for a different understanding of dACC that involves an appreciation of its preferential involvement in pain.

Right. It wasn’t a resounding call for a different understanding of dACC, because it wasn’t a paper about the dACC—a brain region I lack any deep interest in or knowledge of—it was a paper about Neurosynth and reverse inference.

Here are quotes from other papers showing how they view the dACC in light of TY’11:

* Poldrack (2012) “The striking insight to come from analyses of this database (Yarkoni et al., in press) is that some regions (e.g., anterior cingulate) can show high degrees of activation in forward inference maps, yet be of almost no use for reverse inference due to their very high base rates of activation across studies“
* Chang, Yarkoni et al. (2012) “the ACC tends to show substantially higher rates of activation than other regions in neuroimaging studies (Duncan and Owen 2000; Nelson et al. 2010; Yarkoni et al. 2011), which has lead some to conclude that the network is processing goal-directed cognition (Yarkoni et al. 2009)“
* Atlas & Wager (2012) “In fact, the regions that are reliably modulated (insula, cingulate, and thalamus) are actually not specific to pain perception, as they are activated by a number of processes such as interoception, conflict, negative affect, and response inhibition“

I won’t speak for papers I’m not an author on, but with respect to the quote from the Chang et al paper, I’m not sure what L&E’s point actually is. In Yarkoni et al. (2009), I argued that “effort” might be a reasonable generic way to characterize the ubiquitous role of the frontoparietal “task-positive” network in cognition. I mistakenly called the region in question ‘dACC’ when I should have said ‘preSMA’. I already gave L&E deserved credit in my last post for correcting my poor knowledge of anatomy. But I would think that, if anything, the fact that I was routinely confusing these terms circa 2011 should lead L&E to conclude that maybe I don’t know or care very much about the dACC, and not that I’m a proud advocate for a strong theory of dACC function that many other researchers also subscribe to. I think L&E give me far too much credit if they think that my understanding of the dACC in 2011 (or, for that matter, now) is somehow representative of the opinions of experts who study that region.

Perhaps the reason why people who cite TY’11 in their discussion of dACC didn’t pay much attention to the above quote from TY’11 (““For pain, the regions of maximal pain-related“¦“) was because they read and endorsed the following more direct conclusion that followed ““¦because the dACC is activated consistently in all of these states [cognitive control, pain, emotion], its activation may not be diagnostic of any one of them“ (bracketed text added). If this last quote is taken as TY’11’s global statement regarding dACC function, then it strikes us still as quite novel to assert that the dACC is more consistently associated with one category of processes (pain) than others (executive, conflict, and salience processes).

I don’t think TY’11 makes any ‘global statement regarding dACC function’, because TY’11 was a methodological paper about the nature of reverse inference, not a paper about grand models of dACC function. As for the quote L&E reproduce, here’s the full context:

These results showed that without the ability to distinguish consistency from selectivity, neuroimaging data can produce misleading inferences. For instance, neglecting the high base rate of DACC activity might lead researchers in the areas of cognitive control, pain and emotion to conclude that the DACC has a key role in each domain. Instead, because the DACC is activated consistently in all of these states, its activation may not be diagnostic of any one of them and conversely, might even predict their absence. The NeuroSynth framework can potentially address this problem by enabling researchers to conduct quantitative reverse inference on a large scale.

I stand by everything I said here, and I’m not sure what L&E object to. It’s demonstrably true if you look at Figure 2 in TY’11 that pain, emotion, and cognitive control all robustly activate the dACC in the forward inference map, but not in the reverse inference maps. The only sense I can make of L&E’s comment is if they’re once again conflating z-scores with probabilities, and assuming that the presence of significant activation for pain means that dACC is in fact diagnostic for pain. But, as I showed much earlier in this post, that would betray very deep misunderstanding of what the reverse inference maps generated by Neurosynth mean. There is absolutely no basis for concluding, in any individual study, that people are likely to be perceiving pain just because the dACC is active.

In the article, we showed forward and reverse inference maps for 21 terms and then another 9 in the supplemental materials. These are already crowded busy figures and so we didn’t have room to show multiple slices for each term. Fortunately, since Neurosynth is easily accessible (go check it out now at neurosynth.org ““ its awesome!) you can look at anything we didn’t show you in the paper. Tal takes us to task for this.

He then shows a bunch of maps from x=-8 to x=+8 on a variety of terms. Many of these terms weren’t the focus of our paper because we think they are in the same class of processes as pain (as noted above). So it’s no surprise to us that terms such as “˜fear,’ “˜empathy,’ and “˜autonomic’ produce dACC reverse inference effects. In the paper, we reported that “˜reward’ does indeed produce reverse inference effects in the anterior portion of the dACC (and show the figure in the supplemental materials), so no surprise there either. Then at the bottom he shows cognitive control, conflict, and inhibition which all show very modest footprints in dACC proper, as we report in the paper.

Once again: L&E are not entitled to exclude a large group of viable candidate functions from their analysis simply because they believe that they’re “in the same class of [distress-related affect] processes” (a claim that many people, including me, would dispute). If proponents of the salience monitoring view wrote a Neurosynth-based paper neglecting to compare salience with pain because “pain is always salient, so it’s in the same class of salience-related processes”, I expect that L&E would not be very happy about it. They should show others the same charity they themselves would expect.

But in any case, if it’s not surprising to L&E that reward, fear, and autonomic control all activate the dACC, then I’m at a loss to understand why they didn’t title the paper something like “the dACC is selectively involved in pain, reward, fear, and autonomic control”. That would have much more accurately represented the results they report, and would be fully consistent with their notion of “weak selectivity”.

There are two things that make the comparison of what he shows and what we reported in the paper not a fair comparison. First, his maps are thresholded at p<.001 and yet all the maps that we report use Neurosynth’s standard, more conservative, FDR criterion of p<.01 (a standard TY literally set). Here, TY is making a biased, apples-to-oranges comparison by juxtaposing the maps at a much more liberal threshold than what we did. Given that each of the terms we were interested in (pain, executive, conflict, salience) had more than 200 studies in the database its not clear why TY moved from FDR to uncorrected maps here.

The reason I used a threshold of p < .001 for this analysis is because it’s what L&E themselves used:

In addition, we used a threshold of Z > 3.1, P < 0.001 as our threshold for indicating significance. This threshold was chosen instead of Neurosynth’s more strict false discovery rate (FDR) correction to maximize the opportunity for multiple psychological terms to “claim“ the dACC.

This is a sensible thing to do here, because L&E are trying to accept the null of no effect (or at least, it’s more sensible than applying a standard, conservative correction). Accepting the null hypothesis because an effect fails to achieve significance is the cardinal sin of null hypothesis significance testing, so there’s no real justification for doing what L&E are trying to do. But if you are going to accept the null, it at least behooves you to use a very liberal threshold for your analysis. I’m not sure why it’s okay for L&E to use a threshold of p < .001 but not for me to do the same (and for what it’s worth, I think p < .001 is still an absurdly conservative cut-off given the context).

Second, the Neurosynth database has been updated since we did our analyses. The number of studies in the database has only increased by about 5% (from 10,903 to 11,406 studies) and yet there are some curious changes. For instance, fear shows more robust dACC now than it did a few months ago even though it only increased from 272 studies to 298 studies.

Although the number of studies has nominally increased by only 5%, this actually reflects the removal of around 1,000 studies as a result of newer quality control heuristics, and the addition of around 1,500 new studies. So it should not be surprising if there are meaningful differences between the two. In any case, it seems odd for L&E to use the discrepancy between old and new versions of the database as a defense of their findings, given that the newer results are bound to be more accurate. If L&E accept that there’s a discrepancy, perhaps what they should be saying is “okay, since we used poorer data for our analyses than what Neurosynth currently contains, we should probably re-run our analyses and revise our conclusions accordingly”.

We were more surprised to discover that the term “˜rejection’ has been removed from the Neurosynth database altogether such that it can no longer be used as a term to generate forward and reverse inference maps (even though it was in the database prior to the latest update).

This claim is both incorrect and mildly insulting. It’s incorrect because the term “rejection” hasn’t been in the online Neurosynth database for nearly two years, and was actually removed three updates ago. And it’s mildly insulting, because all L&E had to do to verify the date at which rejection was removed, as well as understand why, was visit the Neurosynth data repository and inspect the different data releases. Failing that, they could have simply asked me for an explanation, instead of intimating that there are “curious” changes. So let me take this opportunity to remind L&E and other readers that the data displayed on the Neurosynth website are always archived on GitHub. If you don’t like what’s on the website at any given moment, you can always reconstruct the database based on an earlier snapshot. This can be done in just a few lines of Python code, as the IPython notebook I linked to last time illustrates.

As to why the term “rejection” disappeared: in April 2014, I switched from a manually curated set of 525 terms (which I had basically picked entirely subjectively) to the more comprehensive and principled approach of including all terms that passed a minimum frequency threshold (i.e., showing up in at least 60 unique article abstracts). The term “rejection” was not frequent enough to survive. I don’t make decisions about individual terms on a case-by-case basis (well, not since April 2014, anyway), and I certainly hope L&E weren’t implying that I pulled the ‘rejection’ term in response to their paper or any of their other work, because, frankly, they would be giving themselves entirely too much credit.

Anyway, since L&E seem concerned with the removal of ‘rejection’ from Neurosynth, I’m happy to rectify that for them. Here are two maps for the term “rejection” (both thresholded at voxel-wise p < .001, uncorrected):

Meta-analysis of "rejection" in Neurosynth (database version of May 2013 ).
Meta-analysis of “rejection” in Neurosynth (database version of May 2013, 33 studies).
Meta-analysis of "rejection" in Neurosynth (current database version, 58 studies).
Meta-analysis of “rejection” in Neurosynth (current database version, 58 studies).

The first map is from the last public release (March 2013) that included “rejection” as a feature, and is probably what L&E remember seeing on the website (though, again, it hasn’t been online since 2014). It’s based on 33 studies. The second map is the current version of the map, based on 52 studies. The main conclusion I personally would take away from both of these maps is that there’s not enough data here to say anything meaningful, because they’re both quite noisy and based on a small number of studies. This is exactly why I impose a frequency cut-off for all terms I put online.

That said, if L&E would like to treat these “rejection” analyses as admissible evidence, I think it’s pretty clear that these maps actually weigh directly against their argument. In both cases, we see activation in pain-related areas of dACC for the forward inference analysis but not for the reverse. Interestingly, we do see activation in the most anterior part of dACC in both cases. This seems to me entirely consistent with the argument many people have made that subjective representations of emotion (including social pain) are to be found primarily in anterior medial frontal cortex, and that posterior dACC activations for pain have much more to do with motor control, response selection, and fear than with anything affective.

Given that Neurosynth is practically a public utility and federally funded, it would be valuable to know more about the specific procedures that determine which journals and articles are added to the database and on what schedule. Also, what are the conditions that can lead to terms being removed from the database and what are the set of terms that were once included that have now been removed.

I appreciate L&E’s vote of confidence (indeed, I wish that I believed Neurosynth could do half of what they claim it can do). As I’ve repeatedly said in this post and the last one, I’m happy to answer any questions L&E have about Neurosynth methods (preferably on the mailing list, which is publicly archived and searchable). But to date, they haven’t asked me any. I’ll also reiterate that it would behoove L&E to check the data repository on GitHub (which is linked to from the neurosynth.org portal) before they conclude that the information they want isn’t already publicly accessible (because most of it is).

In any event, we did not cherry pick data. We used the data that was available to us as of June 2015 when we wrote the paper. For the four topics of interest, below we provide more representative views of the dACC, thresholded as typical Neurosynth maps are, at FDR p<.01. We’ve made the maps nice and big so you can see the details and have marked in green the dACC region on the different slices (the coronal slice are at y=14 and y=22). When you look at these, we think they tell the same story we told in the paper.

I’m not sure what the point here is. I was not suggesting that L&E were lying; I was arguing that (a) visual inspection of a few slices is no way to make a strong argument about selectivity; (b) the kinds of analyses L&E report are a statistically invalid way to draw the conclusion they are trying to draw, and (c) even if we (inappropriately) use L&E’s criteria, analyses done with more current data clearly demonstrate the presence of plenty of effects for terms other than pain. L&E dispute the first two points (which we’ll come back to), but they don’t seem to contest the last. This seems to me like it should lead L&E to the logical conclusion that they should change their conclusions, since newer and better data are now available that clearly produce different results given the same assumptions.

(I do want to be clear again that I don’t condone L&E’s analyses, which I show above and below in detail simply don’t support their conclusions. I was simply pointing out that even by their own criteria, Neurosynth results don’t support their claims.)

4. Surprising lack of appreciation for what the reverse inference maps show in pretty straightforward manner.

Let’s start with pain and salience. Iannetti and his colleagues have made quite a bit of hay the last few years saying that the dACC is not involved in pain, but rather codes for salience. One of us has critiqued the methods of this work elsewhere (Eisenberger, 2015, Annual Review). The reverse inference maps above show widespread robust reverse inference effects throughout the dACC for pain and not a single voxel for salience. When we ran this initially for the paper, there were 222 studies tagged for the term salience and now that number is up to 269 and the effects are the same.

Should our tentative conclusion be that we should hold off judgment until there is more evidence? TY thinks so: “If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available.“ This would be reasonable if we were talking about topics with 10 or 15 studies in the database. But, there are 269 studies for the term salience and yet there is nothing in the dACC reverse inference maps. I can’t think of anyone who has ever run a meta-analysis of anything with 250 studies, found no accumulated evidence for an effect and then said “we should withhold judgment until more data is available“.

This is another gross misrepresentation of what I said in my commentary. So let me quote what  I actually said. Here’s the context:

While it’s true that terms with fewer associated studies will have more variable (i.e., extreme) posterior probability estimates, this is an unavoidable problem that isn’t in any way remedied by focusing on z-scores instead of posterior probabilities. If some terms have too few studies in Neurosynth to support reliable comparisons with pain, the appropriate thing to do is to withhold judgment until more data is available. One cannot solve the problem of data insufficiency by pretending that p-values or z-scores are measures of effect size.

This is pretty close to the textbook definition of “quoting out of context”. It should be abundantly clear that I was not saying that L&E shouldn’t interpret results from a Neurosynth meta-analysis of 250 studies (which would be absurd). The point of the above quote was that if L&E don’t like the result they get when they conduct meta-analytic comparisons properly with Neurosynth, they’re not entitled to replace the analysis with a statistically invalid procedure that does give results they like.

TY and his collaborators have criticized researchers in major media outlets (e.g. New York Times) for poor reverse inference ““ for drawing invalid reverse inference conclusions from forward inference data. The analyses we presented suggest that claims about salience and the dACC are also based on unfounded reverse inference claims. One would assume that TY and his collaborators are readying a statement to criticize the salience researchers in the same way they have previously.

This is another absurd, and frankly insulting, comparison. My colleagues and I have criticized people for saying that insula activation is evidence that people are in love with their iPhones. I certainly hope that this is in a completely different league from inferring that people must be experiencing pain if the dACC is activated (because if not, some of L&E’s previous work would appear to be absurd on its face). For what it’s worth, I agree with L&E that nobody should interpret dACC activation in a study as strong evidence of “salience”—and, for that matter, also of “pain”. As for why I’m not readying a statement to criticize the salience researchers, the answer is that it’s not my job to police the ACC literature. My interest is in making sure Neurosynth is used appropriately. L&E can rest assured that if someone published an article based entirely on Neurosynth results in which their primary claim was that the dACC is selective for salience, I would have written precisely the same kind of critique. Though it should perhaps concern them that, of the hundreds of published uses of Neurosynth to date, theirs is the first and only one that has moved me to write a critical commentary.

But no. Nowhere in the blog does TY comment on this finding that directly contradicts a major current account of the dACC. Not so much as a “Geez, isn’t it crazy that so many folks these days think the dACC and AI can be best described in terms of salience detection and yet there is no reverse inference evidence at all for this claim.“

Once again: I didn’t comment on this because I’m not interested in the dACC; I’m interested in making sure Neurosynth is used appropriately. If L&E had asked me, “hey, do you think Neurosynth supports saying that dACC activation is a good marker of ‘salience’?”, I would have said “no, of course not.” But L&E didn’t write a paper titled “dACC activity should not be interpreted as a marker of salience”. They wrote a paper titled “the dACC is selective for pain”, in which they argue that pain is the best psychological characterization of dACC—a claim that Neurosynth simply does not support.

For the terms executive and conflict, our Figure 3 in the PNAS paper shows a tiny bit of dACC. We think the more comprehensive figures we’ve included here continue to tell the same story. If someone wants to tell the conflict story of why pain activates the dACC, we think there should be evidence of widespread robust reverse inference mappings from the dACC to conflict. But the evidence for such a claim just isn’t there. Whatever else you think about the rest of our statistics and claims, this should give a lot of folks pause, because this is not what almost any of us would have expected to see in these reverse inference maps (including us).

No objections here.

If you generally buy into Neurosynth as a useful tool (and you should), then when you look at the four maps above, it should be reasonable to conclude, at least among these four processes, that the dACC is much more involved in that first one (i.e. pain). Let’s test this intuition in a new thought experiment.

Imagine you were given the three reverse inference maps below and you were interested in the function of the occipital cortex area marked off with the green outline. You’d probably feel comfortable saying the region seems to have a lot more to do with Term A than Terms B or C. And if you know much about neuroanatomy, you’d probably be surprised, and possibly even angered, when I tell you that Term A is “˜motor’, Term B is “˜engaged’, and Term C is “˜visual’. How is this possible since we all know this region is primarily involved in visual processes? Well it isn’t possible because I lied. Term A is actually “˜visual’ and Term C is “˜motor’. And now the world makes sense again because these maps do indeed tell us that this region is widely and robustly associated with vision and only modestly associated with engagement and motor processes. The surprise you felt, if you believed momentarily that Term A was motor was because you have the same intuition we do that these reverse inference maps tell us that Term A is the likely function of this region, not Term B or Term C ““ and we’d like that reverse inference to be what we always thought this region was associated with ““ vision. It’s important to note that while a few voxels appear in this region for Terms B and C, it still feels totally fine to say this region’s psychological function can best be described as vision-related. It is the widespread robust nature of the effect in Term A, relative to the weak and limited effects of Terms B and C, that makes this a compelling explanation of the region.

I’m happy to grant L&E that it may “feel totally fine” to some people to make a claim like this. But this is purely an appeal to intuition, and has zero bearing on the claim’s actual validity. I hope L&E aren’t seriously arguing that cognitive neuroscientists should base the way we do statistical inference on our intuitions about what “feels totally fine”. I suspect it felt totally fine to L&E to conclude in 2003 that people were experiencing physical pain because the dACC was active, even though there was no evidential basis for such a claim (and there still isn’t). Recall that, in surveys of practicing researchers, a majority of respondents routinely endorse the idea that a p-value of .05 means that that there’s at least a 95% probability that the alternative hypothesis is correct (it most certainly doesn’t mean this). Should we allow people to draw clearly invalid conclusions in their publications on the grounds that it “feels right” to them? Indeed, as I show below, L&E’s arguments for selectivity rest in part on an invalid acceptance of the null hypothesis. Should they be given a free pass on what is probably the cardinal sin of NHST, on the grounds that it probably “felt right” to them to equate non-significance with evidence of absence?

The point of Neurosynth is that it provides a probabilistic framework for understanding the relationship between psychological function and brain activity. The framework has many very serious limitations that, in practice, make it virtually impossible to draw any meaningful reverse inference from observed patterns of brain activity in any individual study. If L&E don’t like this, they’re welcome to build their own framework that overcomes the limitations of Neurosynth (or, they could even help me improve Neurosynth!). But they don’t get to violate basic statistical tenets in favor of what “feels totally fine” to them.

Another point of this thought experiment is that if Term A is what we expect it to be (i.e. vision) then we can keep assuming that Neurosynth reverse inference maps tell us something valuable about the function of this region. But if Term A violates our expectation of what this region does, then we are likely to think about the ways in which Neurosynth’s results are not conclusive on this point.

We suspect if the dACC results had come out differently, say with conflict showing wide and robust reverse inference effects throughout the dACC, and pain showing little to nothing in dACC, that most of our colleagues would have said “Makes sense. The reverse inference map confirms what we thought ““ that dACC serves a general cognitive function of detecting conflicts.“ We think it is because of the content of the results rather than our approach that is likely to draw ire from many.

I can’t speak for L&E’s colleagues, but my own response to their paper was indeed driven entirely by their approach. If someone had published a paper using Neurosynth to argue that the dACC is selective for conflict, using the same kinds of arguments L&E make, I would have written exactly the same kind of critique I wrote in response to L&E’s paper. I don’t know how I can make it any clearer that I have zero attachment to any particular view of the dACC; my primary concern is with L&E’s misuse of Neurosynth, not what they or anyone else thinks about dACC function. I’ve already made it clear several times that I endorse their conclusion that conflict, salience, and cognitive control are not adequate explanations for dACC function. What they don’t seem to accept is that pain isn’t an adequate explanation either, as the data from Neurosynth readily demonstrate.

5. L&E did the wrong analyses

TY suggests that we made a major error by comparing the Z-scores associated with different terms and should have used posterior probabilities instead. If our goal had been to compare effect sizes this might have made sense, but comparing effect sizes was not our goal. Our goal was to see whether there was accumulated evidence across studies in the Neurosynth database to support reverse inference claims from the dACC.

I’ve already addressed the overarching problem with L&E’s statistical analyses in the first part of this post. Below I’ll just walk through each of L&E’s assertions in detail and point out all of the specific issues in detail. I’ll warn you right now that this is not likely to make for very exciting reading.

While we think the maps for each term speak volumes just from visual inspection, we thought it was also critical to run the comparisons across terms directly. We all know the statistical error of showing that A is significant, while B is not and then assuming, but not testing A > B, directly. TY has a section called “A>B does not imply ~B“ (where ~B means “˜not B’). Indeed it does not, but all the reverse inference maps for the executive, conflict, and salience terms already established ~B. We were just doing due diligence by showing that the difference between A and B was indeed significant.

I apologize for implying that L&E weren’t aware that A > B doesn’t entail ~B. I drew that conclusion because the only other way I could see their claim of selectivity making any sense is if they were interpreting a failure to detect a significant effect for B as positive evidence of no effect. I took that to be much more unlikely, because it’s essentially the cardinal sin of NHST. But their statement here explicitly affirms that this is, in fact, exactly what they were arguing—which leads me to conclude that they don’t understand the null hypothesis statistical testing (NHST) framework they’re using. The whole point of this section of my post was that L&E cannot conclude that there’s no activity in dACC for terms like conflict or salience, because accepting the null is an invalid move under NHST. Perhaps I wasn’t sufficiently clear about this in my last post, so let me reiterate: the reverse inference maps do not establish ~B, and cannot establish ~B. The (invalid) comparison tests of A > B do not establish ~B, and cannot cannot establish ~B. In fact, no analysis, figure, or number L&E report anywhere in their paper establishes ~B for any of the terms they compare with pain. Under NHST, the only possible result of any of L&E’s analyses that would allow them to conclude that a term is not positively associated with dACC activation would be a significant result in the negative direction (i.e., if dACC activation implied a decrease in likelihood of a term). But that’s clearly not true of any of the terms they examine.

Note that this isn’t a fundamental limitation of statistical inference in general; it’s specifically an NHST problem. A Bayesian model comparison approach would have allowed L&E to make a claim about the evidence for the null in comparison to the alternative (though specifying the appropriate priors here might not be very straightforward). Absent such an analysis, L&E are not in any position to make claims about conflict or salience not activating the dACC—and hence, per their own criteria for selectivity, they have no basis for arguing that pain is selective.

Now, in my last post, I went well beyond this logical objection and argued that, if you analyze the data using L&E’s own criteria, there’s plenty of evidence for significant effects of other terms in dACC. I now regret including those analyses. Not because they were wrong; I stand by my earlier conclusion (which should be apparent to anyone who spends five minutes browsing maps on Neurosynth.org), and this alone should have prevented L&E from making claims about pain selectivity. But the broader point is that I don’t want to give the impression that this debate is over what the appropriate statistical threshold for analysis is—i.e., that maybe if we use p < 0.05, I’m right, and if we use FDR = 0.1, L&E are right. The entire question of which terms do or don’t show a significant effect is actually completely beside the point given that L&E’s goal is to establish that only pain activates the dACC, and that terms like conflict or salience don’t. To accomplish that, L&E would need to use an entirely different statistical framework that allows them them to accept the null (relative to some alternative).

If it’s reasonable to use the Z-scores from Neurosynth to say “How much evidence is there for process A being a reliable reverse inference target for region X“ then it has to be reasonable to compare Z-scores from two analyses to ask “How much MORE evidence is there for process A than process B being a reliable reverse inference target for region X“. This is all we did when we compared the Z-scores for different terms to each other (using a standard formula from a meta-analysis textbook) and we think this is the question many people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region.

I addressed this in the earlier part of this post, where I explained why one cannot obtain support for a reverse inference using z-scores or p-values. Reverse inference is inherently a Bayesian notion, and makes sense only if you’re willing to talk about prior and posterior probabilities. So L&E’s first premise here—i.e., that it’s reasonable to use z-scores from Neurosynth to quantify “evidence for process A being a reliable reverse inference target for region X” is already false.

For what it’s worth, the second premise is also independently false, because it’s grossly inappropriate to use meta-analytic z-score comparison test in this situation. For one thing, there’s absolutely no reason to compare z-scores given that the distributional information is readily available. Rosenthal (the author of the meta-analysis textbook L&E cite) himself explicitly notes that such a test is inferior to effect size-based tests, and is essentially a last-ditch approach. Moreover, the intended use of the test in meta-analysis is to determine whether or not there’s heterogeneity in p-values as a precursor to combining them in an analysis (which is a concern that makes no sense in the context of Neurosynth data). At best, what L&E would be able to say with this test is something like “it looks like these two z-scores may be coming from different underlying distributions”. I don’t know why L&E think this is at all an interesting question here, because we already know with certainty that there can be no meaningful heterogeneity of this sort in these z-scores given that they’re all generated using exactly the same set of studies.

In fact, the problems with the z-score comparison test L&E are using run so deep that I can’t help point out just one truly stupefying implication of the approach: it’s possible, under a wide range of scenarios, to end up concluding that there’s evidence that one term is “preferentially” activated relative to another term even when the point estimate is (significantly) larger for the latter term. For example, consider a situation in which we have a probability of 0.65 for one term with n = 1000 studies, and a probability of 0.8 for a second term with n = 100 studies. The one-sample proportion test for these two samples, versus a null of 0.5, gives z-scores of 9.5 and 5.9, respectively–so both tests are highly significant, as one would expect. But the Rosenthal z-score test favored by L&E tells us that the z-score for the first sample is significantly larger than the z-score for the second. It isn’t just wrong to interpret this as evidence that the first term has a more selective effect; it’s dangerously wrong. A two-sample test for the difference in proportions correctly reveals a significant effect in the expected direction (i.e., the 0.8 probablity in the smaller sample is in fact significantly greater than the 0.65 probability in the much larger sample). Put simply, L&E’s test is broken. It’s not clear that it tests anything meaningful in this context, let alone allowing us to conclude anything useful about functional selectivity in dACC.

As for what people are asking when they look at the Neurosynth maps for any two competing accounts of a neural region: I really don’t know, and I don’t see how that would have any bearing on whether the methods L&E are using are valid or not. What I do know that I’ve never seen anyone else compare Neurosynth z-scores using a meta-analytic procedure intended to test for heterogeneity of effects—and I certainly wouldn’t recommend it.

TY then raises two quite reasonable issues with the Z-score comparisons, one of which we already directly addressed in our paper. First, TY raises the issue that Z-scores increase with accumulating evidence, so terms with more studies in the database will tend to have larger Z-scores. This suggests that terms with the most studies in the database (e.g. motor with 2081 studies) should have significant Z-scores everywhere in the brain. But terms with the most studies don’t look like this. Indeed, the reverse inference map for “functional magnetic“ with 4990 studies is a blank brain with no significant Z-scores.

Not quite. It’s true that for any fixed effect size, z-scores will rise (in absolute value) as sample size increases. But if the true effect size is very small, one will still obtain a negligible z-score even in a very large sample. So while terms with more studies will indeed tend to have larger absolute z-scores, it’s categorically false that “terms with the most studies in the database should have significant z-scores everywhere in the brain”.

However, TY has a point. If two terms have similar true underlying effects in dACC, then the one with the larger number of studies will have a larger Z-score, all else being equal. We addressed this point in the limitations section of our paper writing “It is possible that terms that occur more frequently, like “pain,“ might naturally produce stronger reverse inference effects than less frequent terms. This concern is addressed in two ways. First, the current analyses included a variety of terms that included both more or fewer studies than the term “pain“ and no frequency-based gradient of dACC effects is observable.“ So while pain (410 studies) is better represented in the Neurosynth database than conflict (246 studies), effort (137 studies), or Stroop (162 studies), several terms are better represented than pain including auditory (1004 studies), cognitive control (2474 studies), control (2781 studies), detection (485 studies), executive (531 studies), inhibition (432 studies), motor (1910 studies), and working memory (815). All of these, regardless of whether they are better or worse represented in the Neurosynth database show minimal presence in the dACC reverse inference maps. It’s also worth noting that painful and noxious, with only 158 and 85 studies respectively, both show broader coverage within the dACC than any of the cognitive or salience terms considered in our paper.

L&E don’t seem to appreciate that the relationship between the point estimate of a parameter and the uncertainty around that estimate is not like the relationship between two predictors in a regression, where one can (perhaps) reason logically about what would or should be true if one covariate was having an influence on another. One cannot “rule out” the possibility that sample size is a problem by pointing to some large-N terms with small effects or some small-N terms with large effects. Sampling error is necessarily larger in smaller samples. The appropriate way to handle between-term variation in sample size is to properly build that differential uncertainty into one’s inferential test. Rosenthal’s z-score comparison doesn’t do this. The direct meta-analytic contrast one can perform with Neurosynth does do this, but of course, being much more conservative than the Rosenthal test (appropriately so!), L&E don’t seem to like the results it produces. (And note that the direct meta-analytic contrast would still require one to make strong assumptions about priors if the goal was to make quantitative reverse inferences, as opposed to detecting a mean difference in probability of activation.)

TY’s second point is also reasonable, but is also not a problem for our findings. TY points out that some effects may be easier to produce in the scanner than others and thus may be biased towards larger effect sizes. We are definitely sympathetic to this point in general, but TY goes on to focus on how this is a problem for comparing pain studies to emotion studies because pain is easy to generate in the scanner and emotion is hard. If we were writing a paper comparing effect sizes of pain and emotion effects this would be a problem but (a) we were not primarily interested in comparing effect sizes and (b) we definitely weren’t comparing pain and emotion because we think the aspect of pain that the dACC is involved in is the affective component of pain as we’ve written in many other papers dating back to 2003 (Eisenberger & Lieberman, 2004; Eisenberger, 2012; Eisenberger, 2015).

It certainly is a problem for L&E’s findings. Z-scores are related one-to-one with effect size for any fixed sample size, so if the effect size is artificially increased in one condition, so too is the z-score that L&E stake their (invalid) analysis on. Any bias in the point estimate will necessarily distort the z-value as well. This is not a matter of philosophical debate or empirical conjecture, it’s a mathematical necessity.

Is TY’s point relevant to our actual terms of comparison: executive, conflict, and salience processes? We think not. Conflict tasks are easy and reliable ways to produce conflict processes. In multiple ways, we think pain is actually at a disadvantage in the comparison to conflict. First, pain effects are so variable from one person to the next that most pain researchers begin by calibrating the objective pain stimuli delivered, to each participant’s subjective responses to pain. As a result, each participant may actually be receiving different objective inputs and this might limit the reliability or interpretability of certain observed effects. Second, unlike conflict, pain can only be studied at the low end of its natural range. Due to ethical considerations, we do not come close to studying the full spectrum of pain phenomena. Both of these issues may limit the observation of robust pain effects relative to our actual comparisons of interest (executive, conflict, and salience processes.

Perhaps I wasn’t sufficiently clear, but I gave the pain-emotion contrast as an example. The point is that meta-analytic comparisons of the kind L&E are trying to make are a very dangerous proposition unless one has reason to think that two classes of manipulations are equally “strong”. It’s entirely possible that L&E are right that executive control manipulations are generally stronger than pain manipulations, but that case needs to be made on the basis of data, and cannot be taken for granted.

6. About those effect size comparison maps

After criticizing us for not comparing effect sizes, rather than Z-scores, TY goes on to produce his own maps comparing the effect sizes of different terms and claiming that these represent evidence that the dACC is not selective for pain. A lot of our objections to these analyses as evidence against our claims repeats what’s already been said so we’ll start with what’s new and then only briefly reiterate the earlier points.

a) We don’t think it makes much sense to compare effect sizes for terms in voxels for which there is no evidence that it is a valid reverse inference target. For instance, the posterior probability at 0 26 26 for pain is .80 and for conflict is .61 (with .50 representing a null effect). Are these significantly different from one another? I don’t think it matters much because the Z-score associated with conflict at this spot is 1.37, which is far from significant (or at least it was when we ran our analyses last summer. Strangely, now, any non-significant Z-scores seem to come back with a value of 0, whereas they used to give the exact non-significant Z-score).

I’m not sure why L&E think that statistical significance makes a term a “valid target” for reverse inference (or conversely, that non-significant terms cannot be valid targets). If they care to justify this assertion, I’ll be happy to respond to it. It is, in any case, a moot point, since many of the examples I gave were statistically significant, and L&E don’t provide any explanation as to why those terms aren’t worth worrying about either.

As for the disappearance of non-significant z-scores, that’s a known bug introduced by the last major update to Neurosynth, and it’ll be fixed in the next major update (when the entire database is re-generated).

If I flip a coin twice I might end up with a probability estimate of 100% heads, but this estimate is completely unreliable. Comparing this estimate to those from a coin flipped 10,000 times which comes up 51% heads makes little sense. Would the first coin having a higher probability estimate than the second tell us anything useful? No, because we wouldn’t trust the probability estimate to be meaningful. Similarly, if a high posterior probability is associated with a non-significant Z-score, we shouldn’t take this posterior probability as a particularly reliable estimate.

L&E are correct that it wouldn’t make much sense to compare an estimate from 2 coin flips to an estimate from 10,000 coin flips. But the error is in thinking that comparing p-values somehow addresses this problem. As noted above, the p-value comparison they use is a meta-analytic test that only tells one if a set of z-scores are heterogenous, and is not helpful for comparing proportions when one has actual distributional information available. It would be impossible to answer the question of whether one coin is biased relative to another using this test—and it’s equally impossible to use it to determine whether one term is more important than another for dACC function.

b) TY’s approach for these analyses is to compare the effect sizes for any two processes A & B by finding studies in the database tagged for A but not B and others tagged for B but not A and to compare these two sets. In some cases this might be fine, but in others it leaves us with a clean but totally unrealistic comparison. To give the most extreme example, imagine we did this for the terms pain and painful. It’s possible there are some studies tagged for painful but not pain, but how representative would these studies be of “painful“ as a general term or construct? It’s much like the clinical problem of comparing depression to anxiety by comparing those with depression (but not anxiety) to those with anxiety (but not depression). These folks are actually pretty rare because depression and anxiety are so highly comorbid, so the comparison is hardly a valid test of depression vs. anxiety. Given that we think pain, fear, emotion, and autonomic are actually all in the same class of explanations, we think comparisons within this family are likely to suffer from this issue.

There’s nothing “unrealistic” about this comparison. It’s not the inferential test’s job to make sure that the analyst is doing something sensible, it’s the analyst’s job. Nothing compels L&E to run a comparison between ‘pain’ and ‘painful’, and I fully agree that this would be a dumb thing to do (and it would be an equally dumb thing to do using any other statistical test). One the other hand, comparing the terms ‘pain’ and ’emotion’ is presumably not a dumb thing to do, so it behooves us to make sure that we use an inferential test that doesn’t grossly violate common sense and basic statistical assumptions.

Now, if L&E would like to suggest an alternative statistical test that doesn’t exclude the intersection of the two terms and still (i) produces interpretable results, (ii) weights all studies equally, (iii) appropriately accounts for the partial dependency structure of the data, and (iv) is sufficiently computationally efficient to apply to thousands of terms in a reasonable amount of time (which rules out most permutation-based tests), then I’d be delighted to consider their suggestions. The relevant code can be found here, and L&E are welcome to open a GitHub issue to discuss this further. But unless they have concrete suggestions, it’s not clear what I’m supposed to do with their assertion that doing meta-analytic comparison properly sometimes “leaves us with a clean but totally unrealistic comparison”. If they don’t like the reality, they’re welcome to help me improve the reality. Otherwise they’re simply engaging in wishful thinking. Nobody owes L&E a statistical test that’s both valid and gives them results they like.

c) TY compared topics (i.e., a cluster of related terms), not terms. This is fine, but it is one more way that what TY did is not comparable to what we did (i.e. one more way his maps can’t be compared to those we presented).

I almost always use topics rather than terms in my own analyses, for a variety of reasons (they have better construct validity, are in theory more reliable, reduce the number of comparisons, etc.). I didn’t try out the analyses I ran with any of the term-based features, but I encourage L&E to do so if they like, and I’d be surprised if the results differ appreciably (they should, in general, simply be slightly less robust all around). In any case, I deliberately made my code available so that L&E (or anyone else) could easily reproduce and modify my analyses. (And of course, nothing at all hangs on the results in any case, because the whole premise that this is a suitable way to demonstrate selectivity is unfounded.)

d) Finally and most importantly, our question would not have led us to comparing effect sizes. We were interested in whether there was greater accumulated evidence for one term (i.e. pain) being a reverse inference target for dACC activations than for another term (e.g. conflict). Using the Z-scores as we did is a perfectly reasonable way to do this.

See above. Using the z-scores the way L&E did is not reasonable and doesn’t tell us anything anyone would want to know about functional selectivity.

7. Biases all around

Towards the end of his blog, TY says what we think many cognitive folks believe:

“I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else.“

We think this is very telling because it suggests that the findings such as those in our PNAS paper are likely to be unacceptable regardless of what the data shows.

Another misrepresentation of what I actually said, which was:

One way to see this is to note that when we meta-analytically compare pain with almost any other term in Neurosynth (see the figure above), there are typically a lot of brain regions (extending well outside of dACC and other putative pain regions) that show greater activation for pain than for the comparison condition, and very few brain regions that show the converse pattern. I don’t think it’s plausible to think that much of the brain really prizes pain representation above all else. A more sensible interpretation is that the Neurosynth posterior probability estimates for pain are inflated to some degree by the relative ease of inducing pain experimentally.

The context makes it abundantly clear that I was not making a general statement about the importance of pain in some grand evolutionary sense, but simply pointing out the implausibility of supposing that Neurosynth reverse inference maps provide unbiased windows into the neural substrates of cognition. In the case of pain, there’s tentative evidence to believe that effect sizes are overestimated.

In contrast, we can’t think of too many things that the brain would prize above pain (and distress) representations. People who don’t feel pain (i.e. congenital insensitivity to pain) invariably die an early death ““ it is literally a death sentence to not feel pain. What could be more important for survival? Blind and deaf people survive and thrive, but those without the ability to feel pain are pretty much doomed.

I’m not sure what this observation is supposed to tell us. One could make the same kind of argument about plenty of other functions. People who suffer from a variety of autonomic or motor problems are also likely to suffer horrible early deaths; it’s unclear to me how this would justify a claim like “the brain prizes little above autonomic control”, or what possibly implications such a claim would have for understanding dACC function.

Similar (but not identical) to TY’s conclusions that we opened this blog with, we think the following conclusions are supported by the Neurosynth evidence in our PNAS paper:

I’ll take these one at a time.

* There is more widespread and robust reverse inference evidence for the role of pain throughout the dACC than for executive, conflict, and salience-related processes.

I’m not sure what is meant here by “robust reverse inference evidence”. Neurosynth certainly provides essentially no basis for drawing reverse inferences about the presence of pain in individual studies. (Let me remind L&E once again: at best, the posterior probability for ‘pain’ in dACC is around 80%–but that’s given an assumed based rate of 50%, not the more realistic real-world rate of around 3%). If what they mean is something like “on average, taking the average of all voxels in dACC, there’s more evidence of a statistical association between pain and dACC than pain and conflict monitoring”, then I’m fine with that.

* There is little to no evidence from the Neurosynth database that executive, conflict, and salience-related processes are reasonable reverse inference targets for dACC activity.

Again, this depends on what L&E mean. If they mean that one shouldn’t, upon observing activation in dACC, proclaim that conflict must be present, then they’re absolutely right. But again, the same is true for pain. On the other hand, if they mean that there’s no evidence in Neurosynth for a reverse inference association between these terms and dACC activity, where the criterion is surviving FDR-correction, then that’s clearly not true: for example, the conflict map clearly includes voxels within the dACC. Alternatively, if L&E’s point is that the dACC/preSMA region centrally associated with conflict monitoring or executive control is more dorsal than many (though not all) people have assumed, then I agree with them without qualification.

* Pain processes, particularly the affective or distressing part of pain, are in the same family with other distress-related processes including terms like distress, fear, and negative affect.

I have absolutely no idea what evidence this conclusion is based on. Nothing I can see in Neurosynth seems to support this—let alone anything in the PNAS paper. As I’ve noted several times now, most distress-related terms do not seem to overlap meaningfully with pain-related activations in dACC. To the extent that one thinks spatial overlap is a good criterion for determining family membership (and for what it’s worth, I don’t think it is), the evidence does not seem particularly suggestive of any such relationship (and L&E don’t test it formally in any way).

Postscript. *L&E should have used reverse inference, not forward inference, when examining the anatomical boundaries of dACC.*

We saved this one for the postscript because this has little bearing on the major claims of our paper. In our paper, we observed that when one does a forward inference analysis of the term “˜dACC’ the strongest effect occurs outside the dACC in what is actually SMA. This suggested to us that people might be getting activations outside the dACC and calling them dACC (much as many activations clearly not in the amygdala have been called amygdala because it fits a particular narrative). TY admits having been guilty of this in TY’11 and points out that we made this mistake in our 2003 Science paper on social pain. A couple of thoughts on this.

a) In 2003, we did indeed call an activation outside of dACC (-6 8 45) by the term “dACC“. TY notes that if this is entered into a Neurosynth analysis the first anatomical term that appears is SMA. Fair enough. It was our first fMRI paper ever and we identified that activation incorrectly. What TY doesn’t mention is that there are two other activations from the same paper (-8 20 40; -6 21 41) where the top named anatomical term in Neurosynth is anterior cingulate. And if you read this in TY’s blog and thought “I guess social pain effects aren’t even in the dACC“, we would point you to the recent meta-analysis of social pain by Rotge et al. (2015) where they observed the strongest effect for social pain in the dACC (8 24 24; Z=22.2 PFDR<.001). So while we made a mistake, no real harm was done.

I mentioned the preSMA activation because it was the critical data point L&E leaned on to argue that the dACC was specifically associated with the affective component of pain. Here’s the relevant excerpt from the 2003 social pain paper:

As predicted, group analysis of the fMRI data indicated that dorsal ACC (Fig. 1A) (x ““ 8, y 20, z 40) was more active during ESE than during inclusion (t 3.36, r 0.71, P < 0.005) (23, 24). Self-reported distress was positively correlated with ACC activity in this contrast (Fig. 2A) (x ““ 6, y 8, z 45, r 0.88, P < 0.005; x ““ 4, y 31, z 41, r 0.75, P < 0.005), suggesting that dorsal ACC activation during ESE was associated with emotional distress paralleling previous studies of physical pain (7, 8). The anterior insula (x 42, y 16, z 1) was also active in this comparison (t 4.07, r 0.78, P < 0.005); however, it was not associated with self-reported distress.

Note that both the dACC and anterior insula were activated by the exclusion vs. inclusion contrast, but L&E concluded that it was specifically the dACC that supports the “neural alarm” system, by virtue of being correlated with participants’ subjective reports (whereas the insula was not). Setting aside the fact that these results were observed in a sample size of 13 using very liberal statistical thresholds (so that the estimates are highly variable, spatial error is going to be very high, there’s a high risk of false positives, and accepting the null in the insula because of the absence of a significant effect is probably a bad idea), in focusing on the the preSMA activation in my critique, I was only doing what L&E themselves did in their paper:

Dorsal ACC activation during ESE could reflect enhanced attentional processing, previously associated with ACC activity (4, 5), rather than an underlying distress due to exclusion. Two pieces of evidence make this possibility unlikely. First, ACC activity was strongly correlated with perceived distress after exclusion, indicating that the ACC activity was associated with changes in participants’ self-reported feeling states.

By L&E’s own admission, without the subjective correlation, there would have been little basis for concluding that the effect they observed was attributable to distress rather than other confounds (attentional increases, expectancy violation, etc.). That’s why I focused on the preSMA activation: because they did too.

That said, since L&E bring up the other two activations, let’s consider those too, since they also have their problems. While it’s true that both of them are in the anterior cingulate, according to Neurosynth, neither of them is a “pain” voxel. The top functional associates for both locations are ‘inteference’, ‘task’, ‘verbal’, ‘verbal fluency’, ‘word’, ‘demands’, ‘words’, ‘reading’ … you get the idea. Pain is not significantly associated with these points in Neurosynth. So while L&E might be technically right that these other activations were in the anterior cingulate, if we take Neurosynth to be as reliable a guide to reverse inference as they think, then L&E never had any basis for attributing the social exclusion effect to pain to begin with—because, according to Neurosynth, literally none of the medial frontal cortex activations reported in the 2003 paper are associated with pain. I’ll leave it to others to decide whether “no harm was done” by their claim that the dACC is involved in social pain.

In contrast, TY’11’s mistake is probably of greater significance. Many have taken Figure 3 of TY’11 as strong evidence that the dACC activity can’t be reliably associated with working memory, emotion, or pain. If TY had tested instead (2 8 40), a point directly below his that is actually in dACC (rather than 2 8 50 which TY now acknowledges is in SMA), he would have found that pain produces robust reverse inference effects, while neither working memory or emotion do. This would have led to a very different conclusion than the one most have taken from TY’11 about the dACC.

Nowhere in TY’11 is it claimed that dACC activity isn’t reliably associated with working memory, emotion or pain (and, as I already noted in my last post, I explicitly said that the posterior aspects of dACC are preferentially associated with pain). What I did say is that dACC activation may not be diagnostic of any of these processes. That’s entirely accurate. As I’ve explained at great length above, there is simply no basis for drawing any strong reverse inference on the basis of dACC activation.

That said, if it’s true that many people have misinterpreted what I said in my paper, that would indeed be potentially damaging to the field. I would appreciate feedback from other people on this issue, because if there’s a consensus that my paper has in fact led people to think that dACC plays no specific role in cognition, then I’m happy to submit an erratum to the journal. But absent such feedback, I’m not convinced that my paper has had nearly as much influence on people’s views as L&E seem to think.

b) TY suggested that we should have looked for “dACC“ in the reverse inference map rather than the forward inference map writing “All the forward inference map tells you is where studies that use the term “dACC“ tend to report activation most often“. Yet this is exactly what we were interested in. If someone is talking about dACC in their paper, is that the region most likely to appear in their tables? The answer appears to be no.

No, it isn’t what L&E are interested in. Let’s push this argument to its logical extreme to illustrate the problem: imagine that every single fMRI paper in the literature reported activation in preSMA (plus other varying activations)—perhaps because it became standard practice to do a “task-positive localizer” of some kind. This is far-fetched, but certainly conceptually possible. In such a case, searching for every single region by name (“amygdala”, “V1”, you name it) would identify preSMA as the peak voxel in the forward inference map. But what would this tell us, other than that preSMA is activated with alarming frequency? Nothing. What L&E want to know is what brain regions have the biggest impact on the likelihood that an author says “hey, that’s dACC!”. That’s a matter of reverse inference.

c) But again, this is not one of the central claims of the paper. We just thought it was noteworthy so we noted it. Nothing else in the paper depends on these results.

I agree with this. I guess it’s nice to end on a positive note.

fMRI: coming soon to a courtroom near you?

Science magazine has a series of three (1, 2, 3) articles by Greg Miller over the past few days covering an interesting trial in Tennessee. The case itself seems like garden variety fraud, but the novel twist is that the defense is trying to introduce fMRI scans into the courtroom in order to establish the defendant’s innocent. As far as I can tell from Miller’s articles, the only scientists defending the use of fMRI as a lie detector are those employed by Cephos (the company that provides the scanning service); the other expert witnesses (including Marc Raichle!) seem pretty adamant that admitting fMRI scans as evidence would be a colossal mistake. Personally, I think there are several good reasons why it’d be a terrible, terrible, idea to let fMRI scans into the courtroom. In one way or another, they all boil down to the fact that just  isn’t any shred of evidence to support the use of fMRI as a lie detector in real-world (i.e, non-contrived) situations. Greg Miller has a quote from Martha Farah (who’s a spectator at the trial) that sums it up eloquently:

Farah sounds like she would have liked to chime in at this point about some things that weren’t getting enough attention. “No one asked me, but the thing we have not a drop of data on is [the situation] where people have their liberty at stake and have been living with a lie for a long time,” she says. She notes that the only published studies on fMRI lie detection involve people telling trivial lies with no threat of consequences. No peer-reviewed studies exist on real world situations like the case before the Tennessee court. Moreover, subjects in the published studies typically had their brains scanned within a few days of lying about a fake crime, whereas Semrau’s alleged crimes began nearly 10 years before he was scanned.

I’d go even further than this, and point out that even if there were studies that looked at ecologically valid lying, it’s unlikely that we’d be able to make any reasonable determination as to whether or not a particular individual was lying about a particular event. For one thing, most studies deal with group averages and not single-subject prediction; you might think that a highly statistically significant difference between two conditions (e.g., lying and not lying) necessarily implies a reasonable ability to make predictions at the single-subject level, but you’d be surprised. Prediction intervals for individual observations are typically extremely wide even when there’s a clear pattern at the group level. It’s just easier to make general statements about differences between conditions or groups than it is about what state a particular person is likely to be in given a certain set of conditions.

There is, admittedly, an emerging body of literature that uses pattern classification to make predictions about mental states at the level of individual subjects, and accuracy in these types of application can sometimes be quite high. But these studies invariably operate on relatively restrictive sets of stimuli within well-characterized domains (e.g., predicting which word out of a set of 60 subjects are looking at). This really isn’t “mind reading” in the sense that most people (including most judges and jurors) tend to think of it. And of course, even if you could make individual-level predictions reasonably accurately, it’s not clear that that’s good enough for the courtroom. As a scientist, I might be thrilled if I could predict which of 10 words you’re looking at with 80% accuracy (which, to be clear, is currently a pipe dream in the context of studies of ecologically valid lying). But as a lawyer, I’d probably be very skeptical of another lawyer who claimed my predictions vindicated their client. The fact that increased anterior cingulate activation tends to accompany lying on average isn’t a good reason to convict someone unless you can be reasonably certain that increased ACC activation accompanies lying for that person in that context when presented with that bit of information. At the moment, that’s a pretty hard sell.

As an aside, the thing I find perhaps most curious about the whole movement to use fMRI scanners as lie detectors is that there are very few studies that directly pit fMRI against more conventional lie detection techniques–namely, the polygraph. You can say what you like about the polygraph–and many people don’t think polygraph evidence should be admissible in court either–but at least it’s been around for a long time, and people know more or less what to expect from it. It’s easy to forget that it only makes sense to introduce fMRI scans (which are decidedly costly) as evidence if they do substantially better than polygraphs. Otherwise you’re just wasting a lot of money for a fancy brain image, and you could have gotten just as much information by simply measuring someone’s arousal level as you yell at them about that bloodstained Cadillac that was found parked in their driveway on the night of January 7th. But then, maybe that’s the whole point of trying to introduce fMRI to the courtroom; maybe lawyers know that the polygraph has a tainted reputation, and are hoping that fancy new brain scanning techniques that come with pretty pictures don’t carry the same baggage. I hope that’s not true, but I’ve learned to be cynical about these things.

At any rate, the Science articles are well worth a read, and since the judge hasn’t yet decided whether or not to allow fMRI or not, the next couple of weeks should be interesting…

[hat-tip: Thomas Nadelhoffer]

everything we know about the neural bases of cognitive control, in 20 review articles or less

Okay, not everything. But a lot of what we know. The current issue of Current Opinion in Neurobiology, which features a special focus on cognitive neuroscience, contains are almost 20 short review papers, most of which focus on the neural mechanisms of cognitive control in one guise or another. As the Editors of the special issue (Earl Miller and Liz Phelps) explain in their introduction:

Our goal with this special issue was to highlight integrative approaches to brain function. To this end, we focused on the most integrative of brain functions, cognitive control. Cognitive, or executive, control is the ability to coordinate thought and action by directing them toward goals, often far removed goals.

I’ve only skimmed a couple of articles so far, but it’s a pretty impressive table of contents, and I’m looking forward to reading a lot of the reviews. The nice thing about the Current Opinion series, like the Trends series, is that the reviews are short and focused, so they’re well-suited to people who are very busy and don’t have enough hours in their day (like you), or people who just have a short attention span (like me).

Admittedly, I also have an ulterior motive for mentioning this issue: Todd Braver, Mike Cole and I contributed one of the articles, in which we review the neural bases of individual differences in executive control. I think it’s a really nice paper, the credit for which really goes to Todd and Mike–I mostly just contributed the section on methodological considerations (which is basically a precis of a much longer chapter I wrote with Todd a couple of years ago). Todd and Mike somehow managed to review work on everything from reward and motivation to emotion regulation to working memory capacity to dopamine genes, all in the space of eight pages. It’s a nice review highlighting the importance of modeling not only the central tendency of people’s behavior and brain activation in cognitive neuroscience studies, but also the variation between individuals. Aside from the fact that many people (including me!) find individual differences in cognitive abilities intrinsically interesting, an individual differences approach can provide insights that naturally complement those identified by more common within-subject analyses.

For instance, there’s a giant literature on the critical role the neurotransmitter dopamine plays in maintaining and updating goal representations. Most process models of dopamine function make either explicit or tacit predictions about how individual differences in dopamine function should manifest behaviorally, and recent studies have sought to test some of these predictions using both neuroimaging and molecular genetic techniques. A lot of work has focused on a common polymorphism in the COMT gene, variants of which dramatically alter the efficiency of dopamine degradation in the prefrontal cortex. An (admittedly simplistic) prediction that follows from one standard view of prefrontal dopamine function (that tonic dopamine serves to stabilize active representations) is that people who possess the low-activity met allele (and consequently have higher dopamine levels in PFC) should have a greater capacity to maintain goal representations and sustain attention, which may manifest as improved performance on many working memory tasks. Conversely, people with the val allele, which is associated with lower tonic dopamine levels in PFC, should do worse at tasks requiring sustained attention, but may have greater cognitive flexibility (due to the capacity to switch between goal representations more easily).

This prediction, which is borne out by a number of studies we review, is fundamentally about individual differences, since we typically can’t manipulate people’s COMT genes in the lab (though I know some people who probably really wish we could!). But the point is, even if you’re not intrinsically interested in what makes people different from one another, studying individual variation at a genetic, neural, or behavioral level can often tell you something useful about the models you’re developing. Particularly when it comes to the domain of executive control, where differences between individuals can be quite striking. Almost any mechanistic model of executive control is going to have ‘joints’ that could theoretically vary systematically across individuals, so it makes sense to capitalize on natural variability between people to test some of the predictions that fall out of the model, instead of just treating between-subject variability as the error term in your one-sample t-test.

Anyway, our article is here, and the full issue is here (though it’s behind a paywall, unfortunately).

functional MRI and the many varieties of reliability

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”” paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a “˜reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered “˜excellent’, results between 0.59 and 0.75 be considered “˜good’, results between .40 and .58 be considered “˜fair’, and results lower than 0.40 be considered “˜poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

on the limitations of psychiatry, or why bad drugs can be good too

The Neuroskeptic offers a scathing indictment of the notion, editoralized in Nature this week, that the next decade is going to revolutionize the understanding and treatment of psychiatric disorders:

The 2010s is not the decade for psychiatric disorders. Clinically, that decade was the 1950s. The 50s was when the first generation of psychiatric drugs were discovered – neuroleptics for psychosis (1952), MAOis (1952) and tricyclics (1957) for depression, and lithium for mania (1949, although it took a while to catch on).

Since then, there have been plenty of new drugs invented, but not a single one has proven more effective than those available in 1959. New antidepressants like Prozac are safer in overdose, and have milder side effects, than older ones. New “atypical” antipsychotics have different side effects to older ones. But they work no better. Compared to lithium, newer “mood stabilizers” probably aren’t even as good. (The only exception is clozapine, a powerful antipsychotic, but dangerous side-effects limit its use.)

Those are pretty strong claims–especially the assertion that not a single psychiatric drug has proven more effective than those available in 1959. Are they true? I’m not in a position to know for certain, having had only fleeting contacts here and there with psychiatric research. But I guess I’d be surprised if many basic researchers in psychiatry concurred with that assessment. (I’m sure many clinicians wouldn’t, but that wouldn’t be very surprising.) Still, even if you suppose that present-day drugs are no more effective than those available in 1959 on the average (which may or may not be true), it doesn’t follow that there haven’t been major advances in psychiatric treatment. For one thing, the side effects of many modern drugs do tend to be less severe. The Neuroskeptic is right that atypical antipsychotics aren’t as side effect-free as was once hoped; but consider, in contrast, drugs like lamotrigine or valproate–anticonvulsants nowadays widely prescribed for bipolar disorder–which are undeniably less toxic than lithium (though also no more, and possibly less, effective). If you’re diagnosed with bipolar disorder in 2010, there’s still a good chance that you’ll eventually end up being prescribed with lithium;, but (in most cases) it’s unlikely that that’ll be the first line of treatment. And on the bright side, you could end up with a well-managed case of bipolar disorder that never requires you to take drugs with frequent and severe side effects–something that frankly wouldn’t have been an option for almost anyone in 1959.

That last point gets to what I think is the bigger reason for optimism: choice. Even if new drugs aren’t any better than old drugs on average, they’re probably going to work for different groups of people. One of the things that’s problematic about the way the results of clinical trials are typically interpreted is that if a new drug doesn’t outperform an old one, it’s often dismissed as unhelpful. The trouble with this worldview is that even if drug A helps 60% of people on average and drug B helps 54% of people on average (and the difference is statistically and clinically significant), it may well be that drug B helps people who don’t benefit from drug A. The unfortunate reality is that even relatively stable psychiatric patients usually take a while to find an effective treatment regime; most patients try several treatments before settling on one that works. Simply in virtue of there being dozens more drugs available in 2009 than in 1959, it follows that psychiatric patients are much better off living today than fifty years ago. If an atypical antipsychotic controls your schizophrenia without causing motor symptoms or metabolic syndrome, you never have to try a typical antipsychotic; if valproate works well for your bipolar disorder, there’s no reason for you to ever go on lithium. These aren’t small advances; when you’re talking about millions of people who suffer from each of these disorders worldwide, the introduction of any drug that might help even just a fraction of patients who weren’t helped by older medication is a big deal, translating into huge improvements in quality of life and many tens of thousands of lives saved. That’s not to say we shouldn’t strive to develop drugs that aren’t also better on average than the older treatments; it’s just that it shouldn’t be the only (and perhaps not even the main) criterion we use to gauge efficacy.

Having said that, I do agree with the Neuroskeptic’s assessment as to why psychiatric research and treatment seems to proceed more slowly than research in other areas of neuroscience or medicine:

Why? That’s an excellent question. But if you ask me, and judging by the academic literature I’m not alone, the answer is: diagnosis. The weak link in psychiatry research is the diagnoses we are forced to use: “major depressive disorder”, “schizophrenia”, etc.

There are all sorts of methodological reasons why it’s not a great idea to use discrete diagnostic categories when studying (or developing treatments for) mental health disorders. But perhaps the biggest one is that, in cases where a disorder has multiple contributing factors (which is to say, virtually always), drawing a distinction between people with the disorder and those without it severely restricts the range of expression of various related phenotypes, and may even assign people with positive symptomatology to the wrong half of the divide simply because they don’t have some other (relatively) arbitrary symptoms.

For example, take bipolar disorder. If you classify the population into people with bipolar disorder and people without it, you’re doing two rather unfortunate things. One is that you’re lumping together a group of people who have only a partial overlap of symptomatology, and treating them as though they have identical status. One person’s disorder might be characterized by persistent severe depression punctuated by short-lived bouts of mania every few months; another person might cycle rapidly between a variety of moods multiple times per month, week, or even day. Assigning both people the same diagnosis in a clinical study is potentially problematic in that there may be very different underlying organic disorders, which means you’re basically averaging over multiple discrete mechanisms in your analysis, resulting in a loss of both sensitivity and specificity.

The other problem, which I think is less widely appreciated, is that you’ll invariably have many “control” subjects who don’t receive the diagnosis but share many features with people who do. This problem is analogous to the injunction against using median splits: you almost never want to turn an interval-level variable into an ordinal one if you don’t have to, because you lose a tremendous amount of information. When you contrast a sample of people with a bipolar diagnosis with a group of “healthy” controls, you’re inadvertently weakening your comparison by including in the control group people who would be best characterizing as falling somewhere in between the extremes of pathological and healthy. For example, most of us probably know people who we would characterize as “functionally manic” (sometimes also known as “extraverts”)–that is, people who seem to reap the benefits of the stereotypical bipolar syndrome in the manic phase (high energy, confidence, and activity level) but have none of the downside of the depressive phase. And we certainly know people who seem to have trouble regulating their moods, and oscillate between periods of highs and lows–but perhaps just not to quite the extent necessary to obtain a DSM-IV diagnosis. We do ourselves a tremendous disservice if we call these people “controls”. Sure, they might be controls for some aspects of bipolar symptomatology (e.g., people who are consistently energetic serve as a good contrast to the dysphoria of the depressive phase); but in other respects, they may actually closer to the prototypical patient than to most other people.

From a methodological standpoint, there’s no question we’d be much better off focusing on symptoms rather than classifications. If you want to understand the many different factors that contribute to bipolar disorder or schizophrenia, you shouldn’t start from the diagnosis and work backwards; you should start by asking what symptom constellations are associated with specific mechanisms. And those symptoms may well be present (to varying extents) both in people with and without the disorder in question. That’s precisely the motivation behind the current “endophenotype” movement, where the rationale is that you’re better off trying to figure out what biological and (eventually) behavioral changes a given genetic polymorphism is associated with, and then using that information to reshape taxonomies of mental health disorders, than trying to go directly from diagnosis to genetic mechanisms.

Of course, it’s easy to talk about the problems associated with the way psychiatric diagnoses are applied, and not so easy to fix them. Part of the problem is that, while researchers in the lab have the luxury of using large samples that are defined on the basis of symptomatology rather than classification (a luxury that, as the Neuroskeptic and others have astutely observed, many researchers fail to take advantage of), clinicians generally don’t. When you see a patient come in complain of dsyphoria and mood swings, it’s not particularly useful to say “you seem to be in the 96th percentile for negative affect, and have unusual trouble controlling your mood; let’s study this some more, mmmkay?” What you need is some systematic way of going from symptoms to treatment, and the DSM-IV offers a relatively straightforward (though wildly imperfect) way to do that. And then too, the reality is that most clinicians (at least, the ones I’ve talked to) don’t just rely on some algorithmic scheme for picking out drugs; they instead rely on a mix of professional guidelines, implicit theories, and (occasionally) scientific literature when making decisions about what types of symptom constellations have, in their experience, benefited more or less from specific drugs. The problem is that those decisions often fail to achieve their intended goal, and so you end up with a process of trial-and-error, where most patients might try half a dozen medications before they find one that works (if they’re lucky). But that only takes us back to why it’s actually a good thing that we have so many more medications in 2009 than 1959, even if they’re not necessary individually more effective. So, yes, psychiatric research has some major failings compared to other areas of biomedical research–though I do think that’s partly (though certainly not entirely) because the problems are harder. But I don’t think it’s fair to suggest we haven’t made any solid advances in the treatment or understanding of psychiatric disorders in the last half-century. We have; it’s just that we could do much better.