A new paper published online this week in the Journal of Cerebral Blood Flow & Metabolism this week discusses the infamous problem of circular analysis in fMRI research. The paper is aptly titled “Everything you never wanted to know about circular analysis, but were afraid to ask,” and is authored by several well-known biostatisticians and cognitive neuroscientists–to wit, Niko Kriegeskorte, Martin Lindquist, Tom Nichols, Russ Poldrack, and Ed Vul. The paper has an interesting format, and one that I really like: it’s set up as a series of fourteen questions related to circular analysis, and each author answers each question in 100 words or less.
I won’t bother going over the gist of the paper, because the Neuroskeptic already beat me to the punch in an excellent post a couple of days ago (actually, that’s how I found out about the paper); instead, I’ll just give my own answers to the same set of questions raised in the paper. And since blog posts don’t have the same length constraints as NPG journals, I’m going to be characteristically long-winded and ignore the 100 word limit…
(1) Is circular analysis a problem in systems and cognitive neuroscience?
Yes, it’s a huge problem. That said, I think the term ‘circular’ is somewhat misleading here, because it has the connotation than an analysis is completely vacuous. Truly circular analyses–i.e., those where an initial analysis is performed, and the researchers then conduct a “follow-up” analysis that literally adds no new information–are relatively rare in fMRI research. Much more common are cases where there’s some dependency between two different analyses, but the second one still adds some novel information.
(2) How widespread are slight distortions and serious errors caused by circularity in the neuroscience literature?
I think Nichols sums it up nicely here:
TN: False positives due to circularity are minimal; biased estimates of effect size are common. False positives due to brushing off the multiple testing problem (e.g., “˜P<0.001 uncorrected’ and crossing your fingers) remain pervasive.
The only thing I’d add to this is that the bias in effect size estimates is not only common, but, in most cases, is probably very large.
(3) Are circular estimates useful measures of effect size?
Yes and no. They’re less useful than unbiased measures of effect size. But given that the vast majority of effects reported in whole-brain fMRI analyses (and, more generally, analyses in most fields) are likely to be inflated to some extent, the only way to ensure we don’t rely on circular estimates of effect size would be to disregard effect size estimates entirely, which doesn’t seem prudent.
(4) Should circular estimates of effect size be presented in papers and, if so, how?
Yes, because the only principled alternatives are to either (a) never report effect sizes (which seems much too drastic), or (b) report the results of every single test performed, irrespective of the result (i.e., to never give selection bias an opportunity to rear its head). Neither of these is reasonable. We should generally report effect sizes for all key effects, but they should be accompanied by appropriate confidence intervals. As Lindquist notes:
In general, it may be useful to present any effect size estimate as confidence intervals, so that readers can see for themselves how much uncertainty is related to the point estimate.
A key point I’d add is that the width of the reported CIs should match the threshold used to identify results in the first place. In other words, if you conduct a whole brain analysis at p < .001, you should report all resulting effects with 99.9% CIs, and not 95% CIs. I think this simple step would go a considerable ways towards conveying the true uncertainty surrounding most point estimates in fMRI studies.
(5) Are effect size estimates important/useful for neuroscience research, and why?
I think my view here is closest to Ed Vul’s:
Yes, very much so. Null-hypothesis testing is insufficient for most goals of neuroscience because it can only indicate that a brain region is involved to some nonzero degree in some task contrast. This is likely to be true of most combinations of task contrasts and brain regions when measured with sufficient power.
I’d go further than Ed does though, and say that in a sense, effect size estimates are the only things that matter. As Ed notes, there are few if any cases where it’s plausible to suppose that the effect of some manipulation on brain activation is really zero. The brain is a very dense causal system–almost any change in one variable is going to have downstream effects on many, and perhaps most, others. So the real question we care about is almost never “is there or isn’t there an effect,” it’s whether there’s an effect that’s big enough to actually care about. (This problem isn’t specific to fMRI research, of course; it’s been a persistent source of criticism of null hypothesis significance testing for many decades.)
People sometimes try to deflect this concern by saying that they’re not trying to make any claims about how big an effect is, but only about whether or not one can reject the null–i.e., whether any kind of effect is present or not. I’ve never found this argument convincing, because whether or not you own up to it, you’re always making an effect size claim whenever you conduct a hypothesis test. Testing against a null of zero is equivalent to saying that you care about any effect that isn’t exactly zero, which is simply false. No one in fMRI research cares about r or d values of 0.0001, yet we routinely conduct tests whose results could be consistent with those types of effect sizes.
Since we’re always making implicit claims about effect sizes when we conduct hypothesis tests, we may as well make them explicit so that they can be evaluated properly. If you only care about correlations greater than 0.1, there’s no sense in hiding that fact; why not explicitly test against a null range of -0.1 to 0.1, instead of a meaningless null of zero?
(6) What is the best way to accurately estimate effect sizes from imaging data?
Use large samples, conduct multivariate analyses, report results comprehensively, use meta-analysis… I don’t think there’s any single way to ensure accurate effect size estimates, but plenty of things help. Maybe the most general recommendation is to ensure adequate power (see below), which will naturally minimize effect size inflation.
(7) What makes data sets independent? Are different sets of subjects required?
Most of the authors think (as I do too) that different sets of subjects are indeed required in order to ensure independence. Here’s Nichols:
Only data sets collected on distinct individuals can be assured to be independent. Splitting an individual’s data (e.g., using run 1 and run 2 to create two data sets) does not yield independence at the group level, as each subject’s true random effect will correlate the data sets.
Put differently, splitting data within subjects only eliminates measurement error, and not sampling error. You could in theory measure activation perfectly reliably (in which case the two halves of subjects’ data would be perfectly correlated) and still have grossly inflated effects, simply because the multivariate distribution of scores in your sample doesn’t accurately reflect the distribution in the population. So, as Nichols points out, you always need new subjects if you want to be absolutely certain your analyses are independent. But since this generally isn’t feasible, I’d argue we should worry less about whether or not our data sets are completely independent, and more about reporting results in a way that makes the presence of any bias as clear as possible.
(8) What information can one glean from data selected for a certain effect?
I think this is kind of a moot question, since virtually all data are susceptible to some form of selection bias (scientists generally don’t write papers detailing all the analyses they conducted that didn’t pan out!). As I note above, I think it’s a bad idea to disregard effect sizes entirely; they’re actually what we should be focusing most of our attention on. Better to report confidence intervals that accurately reflect the selection procedure and make the uncertainty around the point estimate clear.
(9) Are visualizations of nonindependent data helpful to illustrate the claims of a paper?
Not in cases where there’s an extremely strong dependency between the selection criteria and the effect size estimate. In cases of weak to moderate dependency, visualization is fine so long as confidence bands are plotted alongside the best fit. Again, the key is to always be explicit about the limitations of the analysis and provide some indication of the uncertainty involved.
(10) Should data exploration be discouraged in favor of valid confirmatory analyses?
No. I agree with Poldrack’s sentiment here:
Our understanding of brain function remains incredibly crude, and limiting research to the current set of models and methods would virtually guarantee scientific failure. Exploration of new approaches is thus critical, but the findings must be confirmed using new samples and convergent methods.
(11) Is a confirmatory analysis safer than an exploratory analysis in terms of drawing neuroscientific conclusions?
In principle, sure, but in practice, it’s virtually impossible to determine which reported analyses really started out their lives as confirmatory analyses and which started life out as exploratory analyses and then mysteriously evolved into “a priori” predictions once the paper was written. I’m not saying there’s anything wrong with this–everyone reports results strategically to some extent–just that I don’t know that the distinction between confirmatory and exploratory analyses is all that meaningful in practice. Also, as the previous point makes clear, safety isn’t the only criterion we care about; we also want to discover new and unexpected findings, which requires exploration.
(12) What makes a whole-brain mapping analysis valid? What constitutes sufficient adjustment for multiple testing?
From a hypothesis testing standpoint, you need to ensure adequate control of the family-wise error (FWE) rate or false discovery rate (FDR). But as I suggested above, I think this only ensures validity in a limited sense; it doesn’t ensure that the results are actually going to be worth caring about. If you want to feel confident that any effects that survive are meaningfully large, you need to do the extra work up front and define what constitutes a meaningful effect size (and then test against that).
(13) How much power should a brain-mapping analysis have to be useful?
As much as possible! Concretely, the conventional target of 80% seems like a good place to start. But as I’ve argued before (e.g., here), that would require more than doubling conventional sample sizes in most cases. The reality is that fMRI studies are expensive, so we’re probably stuck with underpowered analyses for the foreseeable future. So we need to find other ways to compensate for that (e.g., relying more heavily on meta-analytic effect size estimates).
(14) In which circumstances are nonindependent selective analyses acceptable for scientific publication?
It depends on exactly what’s problematic about the analysis. Analyses that are truly circular and provide no new information should never be reported, but those constitute only a small fraction of all analyses. More commonly, the nonindependence simply amounts to selection bias: researchers tend to report only those results that achieve statistical significance, thereby inflating apparent effect sizes. I think the solution to this is to still report all key effect sizes, but to ensure they’re accompanied by confidence intervals and appropriate qualifiers.
Kriegeskorte N, Lindquist MA, Nichols TE, Poldrack RA, & Vul E (2010). Everything you never wanted to know about circular analysis, but were afraid to ask. Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism PMID: 20571517
I think question (9) is the most important in practical terms, because almost all fMRI papers include graphs or scatterplots that are based on a “circular” analysis (although I agree that “circular” is a bit strong…maybe “curvy”?)
In my experience people do this for entirely innocent reasons, because they want more information than just where the effect happens, they want to see “what’s driving it” i.e. is it activation in condition A or deactivation in condition B? Or in the case of correlation analyses they want to see if it’s driven by outliers.
People often just don’t realize that these graphs present inflated measures of the effect size. What people should do in such cases is to extract the data averaged across an anatomical ROI, and in fact this is often what people do do, because sometimes it’s easier e.g. in FSL it’s very easy because you can use featquery; I think it’s harder in SPM. The problem is when people don’t tell you what they did: it is often unclear, from the paper, whether a certain graph is based on an anatomical ROI, the activation blob, or worst of all, the peak voxel.
I really think that what we need (and not just to avoid this problem) is a consensus statement on fMRI data presentation. But that’s a long story.
I think using anatomical ROIs is generally a good idea, but the obvious problem there is it presumes you know where to look ahead of time, which we generally don’t. If you do a whole-brain analysis, identify activation in the hippocampus that shows some effect, it’s no good to select a hippocampus ROI post-hoc; you’re still going to have a substantial bias (on average). That said, I agree it’s often hard to tell what people did, and better reporting is key. The closest thing I know of to a consensus statement on data presentation is Poldrack et al (2008)–“Guidelines for reporting an fMRI study”–which is a really nice paper, but I’m not sure how much influence it’s actually had in changing the way people report findings.
I don’t think using an anatomical ROI post-hoc does no good – it will tend to reduce the bias. Indeed it might be overly conservative (say, the true activation is very strong but only in the CA1 subfield of the hippocampus, and your ROI is the whole hippocampus). Or it might only reduce it a little.
But what annoys me is when people say “activation in the hippocampus predicted X” when what the mean is “our blob which takes up about 10% of the hippocampus predicted X”. In which case, unless they want to use the latter phrasing, they should use an anatomical ROI.
I don’t really see the point in using an anatomical ROI post-hoc… seems like the worst of both worlds. To the degree that the anatomical mask include the voxels identified based on a functional analysis, you’re still going to get bias. And to the degree that it includes other voxels, there isn’t really any justification for selecting those voxels in the first place, so it’s unclear why you’d want to average them together. And note that the bigger problem is that you’re now likely to end up making claims you’re not really entitled to. To take your example, if I detect activation in CA1 functionally, it would be a mistake for me to say “ah-hah! I think this is really a subset of the hippocampus, so let me test the whole hippocampus!” That would be precisely the kind of circular analysis that does give rise to false positives, and not just effect size inflation.
To see that, consider that you could have picked any other voxels to include in your “anatomical” mask. For instance, suppose I said “well, I know CA1 is interconnected with the precuneus, so I’m going to use a mask that includes CA1 and anatomically-defined sections of the precuneus”. Well, it wouldn’t be surprising if you got a statistically significant result for that entire mask; after all, you’ve just averaged a whole bunch of voxels known to show the effect you’re looking for with a pile of others you know nothing about! Even if the actual effect in every non-CA1 voxel was exactly 0, you’d still have a high probability of rejecting the null. So I think it’s actually a very bad idea to use post-hoc anatomical ROIs.
Tal, I agree with your point that the use of an anatomical ROI post-hoc (i.e., hippocampus) is as circular as anything else. Are whole-brain analyses themselves the problem? I think some methods of multiple comparison correction are extremely problematic. For the life of me, I don’t understand how FDR works at the cluster level, except for “we used this function in SPM”. In a lot of papers I see ridiculously small clusters that survive correction, which makes me wonder.
Hi Yigal,
I think whole-brain analyses are definitely problematic in many ways, but I don’t know that there’s any better alternative. As I discussed here, it can be very misleading to report just ROI analyses. Not to mention that we’re just not that smart much of the time, and often, activation arises in areas we wouldn’t have expected it to. I think maybe a better solution though is to rely more heavily on multivariate analyses (e.g., ICA) that help reduce the number of tested variables.
On the FDR thing, I really have no idea how the topological FDR function in SPM works, since I don’t use SPM. There was a NeuroImage paper by Chumbley et all explaining it a few months ago, but I haven’t read it. One thing I do think is an underappreciated problem is that giving people freedom to choose the spatial extent threshold (and I don’t know if SPM does this, but I suspect it does) is generally going to inflate the false positive rate, because people will try various combinations (e.g., p < .001 with 80 voxels, p < .0001 with 40 voxels, etc.) until they get one they like, which is just another way of capitalizing on chance...
Maybe this is naive but I don’t understand why all data can’t be provided?
“b) report the results of every single test performed, irrespective of the result (i.e., to never give selection bias an opportunity to rear its head). Neither of these is reasonable. ”
I assume this stuff is all recorded, so what is the big problem providing it? Is there some kind of space issue? I do computer science… not experimental science … the idea you can’t provide another few hundred or thousand test results seems strange to me. Is it because everyone is focussed on creating nice articles that will fit in a magazine..?
Hi confused,
This is an excellent question with a complex answer. In theory, there isn’t any reason why people couldn’t share their data–either the processed results, or the raw data itself. But there are some technical and sociological challenges. The technical challenge is that fMRI datasets tend to be enormous (we’re not talking thousands, but billions of data points per study, with tens of gigabytes per subject), and processing streams are not uniform across the field (i.e., everyone has a slightly different way of processing their data, because there are many choices to be made, and people hold different opinions). So it’s actually not so straightforward to develop repositories that can handle that volume of data in a way that’s easily accessible and easily comprehensible to investigators across the entire field. That said, this is something that people have been working towards for a long time, and there are some major efforts underway.
The sociological barriers are probably greater. The reality is that people don’t always like to share their data. Part of this is a concern about being “scooped” and part of it is a fear of losing control of one’s data–other people can use it for unintended purposes, find flaws, etc. Most of these are not really good reasons to avoid data sharing, but they are human foibles and we have to work around them rather than through them. The key here is to incentivize people in a way that makes it clear that it’s in their best interest to share their data rather than hoard it. Again, there are efforts being made, but so far there’s no large-scale success story.
The final issue is that providing the data wouldn’t actually be all that useful in and of itself; because there’s so much data, and its format and specifics vary drastically across studies and laboratories, another challenge is to develop tools that can synthesize all the results in a meaningful way. I’ve actually been working on this problem a lot myself lately, and hope to publish some work on that soon, but it’s some ways off.
So, to sum up, it’s just one of those things that everyone agrees would be great in principle, and is working towards slowly, but in practice turns out to be very difficult to do.
This is a great summary, thanks. I actually think some of your comments point at least implicitly, to a conceptual problem with much fMRI reasearch that may even be more pervasive and more serious than statistical issues being currently emphasized, and that is confirmation bias. And this is related to #10 an #11: I would actually argue there is too much confirmatory research in fMRI. By that I mean, predictions being “tested” in many fMRI studies are so vague, underspecified, and superficial, it is impossible to tell whether any outcome could in principle have been found that would have disconfirmed the prediction. There is tremendous pressure in cognitive neuroscience to use fMRI less for exploratory brain-mapping to more theoretically-driven confirmatory studies, but in order to do this effectively requires well-specified falsifiable predictions about fMRI data that we just don’t have. I think in many cases, the interpretive problems arise less from statistical misapplications, and more from from people doing work that is actually exploratory in nature (because the predictions are underspecified), but attempting to draw confirmatory conclusions from the results, e.g., assuming, without argument, that any other subset of regions (or voxels within regions) wouldn’t provide just as strong confirmation as what they actually got. FDR and ROI analysis done with an eye on avoiding circularity can guard against this somewhat, but only if the conceptual work has been done beforehand to specify clearly how the hypotheses do and do not map onto possible outcomes. Otherwise, the study is exploratory and we should be able to admit that without shame.