specificity statistics for ROI analyses: a simple proposal

The brain is a big place. In the context of fMRI analysis, what that bigness means is that a typical 3D image of the brain might contain anywhere from 50,000 – 200,000 distinct voxels (3D pixels). Any of those voxels could theoretically show meaningful activation in relation to some contrast of interest, so the only way to be sure that you haven’t overlooked potentially interesting activations is to literally test every voxel (or, given some parcellation algorithm, every region).

Unfortunately, the problem that approach raises–which I’ve discussed in more detail here–is the familiar one of multiple comparisons: If you’re going to test 100,000 locations, it’s not really fair to test each one at the conventional level of p < .05, because on average, you’ll get about 5,000 statistically significant results just by chance that way. So you need to do something to correct for the fact that you’re running thousands of tests. The most common approach is to simply make the threshold for significance more conservative–for example, by testing at p < .0001 instead of p < .05, or by using some combination of intensity and cluster extent thresholds (e.g., you look for 20 contiguous voxels that are all significant at, say, p < .001) that’s supposed to guarantee a cluster-wise error rate of .05.

There is, however, a natural tension between false positives and false negatives: When you make your analysis more conservative, you let fewer false positives through the filter, but you also keep more of the true positives out. A lot of fMRI analysis really just boils down to walking a very thin line between running overconservative analyses that can’t detect anything but the most monstrous effects, and running overly liberal analyses that lack any real ability to distinguish meaningful signals from noise. One very common approach that fMRI researchers have adopted in an effort to optimize this balance is to use complementary hypothesis-driven and whole-brain analyses. The idea is that you’re basically carving the brain up into two separate search spaces: One small space for which you have a priori hypotheses that can be tested using a small number of statistical comparisons, and one much larger space (containing everything but the a priori space) where you continue to use a much more conservative threshold.

For example, if I believe that there’s a very specific chunk of right inferotemporal cortex that’s specialized for detecting clown faces, I can focus my hypothesis-testing on that particular region, without having to pretend that all voxels are created equal. So I delineate the boundaries of a CRC (Clown Representation Cortex) region-of-interest (ROI) based on some prior criteria (e.g., anatomy, or CRC activation in previous studies), and then I can run a single test at p < .05 to test my hypothesis, no correction needed. But to ensure that I don’t miss out on potentially important clown-related activation elsewhere in the brain, I also go ahead and run an additional whole-brain analysis that’s fully corrected for multiple comparisons. By coupling these two analyses, I hopefully get the best of both worlds. That is, I combine one approach (the ROI analysis) that maximizes power to test a priori hypotheses at the cost of an inability to detect effects in unexpected places with another approach (the whole-brain analysis) that has a much more limited capacity to detect effects in both expected and unexpected locations.

This two-pronged strategy is generally a pretty successful one, and I’d go so far as to say that a very large minority, if not an outright majority, of fMRI studies currently use it. Used wisely, I think it’s really an invaluable strategy. There is, however, one fairly serious and largely unappreciated problem associated with the incautious application of this approach. It has to do with claims about the specificity of activation that often tend to accompany studies that use a complementary ROI/whole-brain strategy. Specifically, a pretty common pattern is for researchers to (a) confirm their theoretical predictions by successfully detecting activation in one or more a priori ROIs; (b) identify few if any whole-brain activations; and consequently, (c) conclude that not only were the theoretical predictions confirmed, but that the hypothesized effects in the a priori ROIs were spatially selective, because a complementary whole-brain analysis didn’t turn up much (if anything). Or, to put it in less formal terms, not only were we right, we were really right! There isn’t any other part of the brain that shows the effect we hypothesized we’d see in our a priori ROI!

The problem with this type of inference is that there’s usually a massive discrepancy in the level of power available to detect effects in a priori ROIs versus the rest of the brain. If you search at p < .05 within some predetermined space, but at only p < .0001 everywhere else, you’re naturally going to detect results at a much lower rate everywhere else. But that’s not necessarily because there wasn’t just as much to look at everywhere else; it could just be because you didn’t look very carefully. By way of analogy, if you’re out picking berries in the forest, and you decide to spend half your time on just one bush that (from a distance) seemed particularly berry-full, and the other half of your time divided between the other 40 bushes in the area, you’re not really entitled to conclude that you picked the best bush all along simply because you came away with a relatively full basket. Had you done a better job checking out the other bushes, you might well have found some that were even better, and then you’d have come away carrying two baskets full of delicious, sweet, sweet berries.

Now, in an ideal world, we’d solve this problem by simply going around and carefully inspecting all the berry bushes, until we were berry, berry sure really convinced that we’d found all of the best bushes. Unfortunately, we can’t do that, because we’re out here collecting berries on our lunch break, and the boss isn’t paying us to dick around in the woods. Or, to return to fMRI World, we simply can’t carefully inspect every single voxel (say, by testing it at p < .05), because then we’re right back in mega-false-positive-land, which we’ve already established as a totally boring place we want to avoid at all costs.

Since an optimal solution isn’t likely, the next best thing is to figure out what we can do to guard against careless overinterpretation. Here I think there’s actually a very simple, and relatively elegant, solution. What I’ve suggested when I’ve given recent talks on this topic is that we mandate (or at least, encourage) the use of what you could call a specificity statistic (SS). The SS is a very simple measure of how specific a given ROI-level finding is; it’s just the proportion of voxels that are statistically significant when tested at the same level as the ROI-level effects. In most cases, that’s going to be p < .05, so the SS will usually just be the proportion of all voxels anywhere in the brain that are activated at p < .05.

To see why this is useful, consider what could no longer happen: Researchers would no longer be able to (inadvertently) capitalize on the fact that the one or two regions they happened to define as a priori ROIs turned up significant effects when no other regions did in a whole-brain analysis. Suppose that someone reports a finding that negative emotion activates the amygdala in an ROI analysis, but doesn’t activate any other region in a whole-brain analysis. (While I’m pulling this particular example out of a hat here, I feel pretty confident that if you went and did a thorough literature review, you’d find at least three or four studies that have made this exact claim.) This is a case where the SS would come in really handy. Because if the SS is, say, 26% (i.e., about a quarter of all voxels in the brain are active at p < .05, even if none survive full correction for multiple comparisons), you would want to draw a very different conclusion than if it was just 4%. If fully a quarter of the brain were to show greater activation for a negative-minus-neutral emotion contrast, you wouldn’t want to conclude that the amygdala was critically involved in negative emotion; a better interpretation would be that the researchers in question just happened to define an a priori region that fell within the right quarter of the brain. Perhaps all that’s happening is that negative emotion elicits a general increase in attention, and much of the brain (including, but by no means limited to, the amygdala) tends to increase activation correspondingly. So as a reviewer and reader, you’d want to know how specific the reported amygdala activation really is*. But in the vast majority of papers, you currently have no way of telling (and the researchers probably don’t even know the answer themselves!).

The principal beauty of this statistic lies in its simplicity: It’s easy to understand, easy to calculate, and easy to report. Ideally, researchers would report the SS any time ROI analyses are involved, and would do it for every reported contrast. But at minimum, I think we should all encourage each other (and ourselves) to report such a statistic any time we’re making a specificity claim about ROI-based results. In other words,if you want to argue that a particular cognitive function is relatively localized to the ROI(s) you happened to select, you should be required to show that there aren’t that many other voxels (or regions) that show the same effect when tested at the liberal threshold you used for the ROI analysis. There shouldn’t be an excuse for not doing this; it’s a very easy procedure for researchers to implement, and an even easier one for reviewers to demand.

* An alternative measure of specificity would be to report the percentile ranking of all of the voxels within the ROI mask relative to all other individual voxels. In the above example, you’d assign very different interpretations depending on whether the amygdala was in the 32nd or 87th percentile of all voxels, when ordered according to the strength of the effect for the negative – neutral contrast.