fourteen questions about selection bias, circularity, nonindependence, etc.

A new paper published online this week in the Journal of Cerebral Blood Flow & Metabolism this week discusses the infamous problem of circular analysis in fMRI research. The paper is aptly titled “Everything you never wanted to know about circular analysis, but were afraid to ask,” and is authored by several well-known biostatisticians and cognitive neuroscientists–to wit, Niko Kriegeskorte, Martin Lindquist, Tom Nichols, Russ Poldrack, and Ed Vul. The paper has an interesting format, and one that I really like: it’s set up as a series of fourteen questions related to circular analysis, and each author answers each question in 100 words or less.

I won’t bother going over the gist of the paper, because the Neuroskeptic already beat me to the punch in an excellent post a couple of days ago (actually, that’s how I found out about the paper); instead,  I’ll just give my own answers to the same set of questions raised in the paper. And since blog posts don’t have the same length constraints as NPG journals, I’m going to be characteristically long-winded and ignore the 100 word limit…

(1) Is circular analysis a problem in systems and cognitive neuroscience?

Yes, it’s a huge problem. That said, I think the term ‘circular’ is somewhat misleading here, because it has the connotation than an analysis is completely vacuous. Truly circular analyses–i.e., those where an initial analysis is performed, and the researchers then conduct a “follow-up” analysis that literally adds no new information–are relatively rare in fMRI research. Much more common are cases where there’s some dependency between two different analyses, but the second one still adds some novel information.

(2) How widespread are slight distortions and serious errors caused by circularity in the neuroscience literature?

I think Nichols sums it up nicely here:

TN: False positives due to circularity are minimal; biased estimates of effect size are common. False positives due to brushing off the multiple testing problem (e.g., “˜P<0.001 uncorrected’ and crossing your fingers) remain pervasive.

The only thing I’d add to this is that the bias in effect size estimates is not only common, but, in most cases, is probably very large.

(3) Are circular estimates useful measures of effect size?

Yes and no. They’re less useful than unbiased measures of effect size. But given that the vast majority of effects reported in whole-brain fMRI analyses (and, more generally, analyses in most fields) are likely to be inflated to some extent, the only way to ensure we don’t rely on circular estimates of effect size would be to disregard effect size estimates entirely, which doesn’t seem prudent.

(4) Should circular estimates of effect size be presented in papers and, if so, how?

Yes, because the only principled alternatives are to either (a) never report effect sizes (which seems much too drastic), or (b) report the results of every single test performed, irrespective of the result (i.e., to never give selection bias an opportunity to rear its head). Neither of these is reasonable. We should generally report effect sizes for all key effects, but they should be accompanied by appropriate confidence intervals. As Lindquist notes:

In general, it may be useful to present any effect size estimate as confidence intervals, so that readers can see for themselves how much uncertainty is related to the point estimate.

A key point I’d add is that the width of the reported CIs should match the threshold used to identify results in the first place. In other words, if you conduct a whole brain analysis at p < .001, you should report all resulting effects with 99.9% CIs, and not 95% CIs. I think this simple step would go a considerable ways towards conveying the true uncertainty surrounding most point estimates in fMRI studies.

(5) Are effect size estimates important/useful for neuroscience research, and why?

I think my view here is closest to Ed Vul’s:

Yes, very much so. Null-hypothesis testing is insufficient for most goals of neuroscience because it can only indicate that a brain region is involved to some nonzero degree in some task contrast. This is likely to be true of most combinations of task contrasts and brain regions when measured with sufficient power.

I’d go further than Ed does though, and say that in a sense, effect size estimates are the only things that matter. As Ed notes, there are few if any cases where it’s plausible to suppose that the effect of some manipulation on brain activation is really zero. The brain is a very dense causal system–almost any change in one variable is going to have downstream effects on many, and perhaps most, others. So the real question we care about is almost never “is there or isn’t there an effect,” it’s whether there’s an effect that’s big enough to actually care about. (This problem isn’t specific to fMRI research, of course; it’s been a persistent source of criticism of null hypothesis significance testing for many decades.)

People sometimes try to deflect this concern by saying that they’re not trying to make any claims about how big an effect is, but only about whether or not one can reject the null–i.e., whether any kind of effect is present or not. I’ve never found this argument convincing, because whether or not you own up to it, you’re always making an effect size claim whenever you conduct a hypothesis test. Testing against a null of zero is equivalent to saying that you care about any effect that isn’t exactly zero, which is simply false. No one in fMRI research cares about r or d values of 0.0001, yet we routinely conduct tests whose results could be consistent with those types of effect sizes.

Since we’re always making implicit claims about effect sizes when we conduct hypothesis tests, we may as well make them explicit so that they can be evaluated properly. If you only care about correlations greater than 0.1, there’s no sense in hiding that fact; why not explicitly test against a null range of -0.1 to 0.1, instead of a meaningless null of zero?

(6) What is the best way to accurately estimate effect sizes from imaging data?

Use large samples, conduct multivariate analyses, report results comprehensively, use meta-analysis… I don’t think there’s any single way to ensure accurate effect size estimates, but plenty of things help. Maybe the most general recommendation is to ensure adequate power (see below), which will naturally minimize effect size inflation.

(7) What makes data sets independent? Are different sets of subjects required?

Most of the authors think (as I do too) that different sets of subjects are indeed required in order to ensure independence. Here’s Nichols:

Only data sets collected on distinct individuals can be assured to be independent. Splitting an individual’s data (e.g., using run 1 and run 2 to create two data sets) does not yield independence at the group level, as each subject’s true random effect will correlate the data sets.

Put differently, splitting data within subjects only eliminates measurement error, and not sampling error. You could in theory measure activation perfectly reliably (in which case the two halves of subjects’ data would be perfectly correlated) and still have grossly inflated effects, simply because the multivariate distribution of scores in your sample doesn’t accurately reflect the distribution in the population. So, as Nichols points out, you always need new subjects if you want to be absolutely certain your analyses are independent. But since this generally isn’t feasible, I’d argue we should worry less about whether or not our data sets are completely independent, and more about reporting results in a way that makes the presence of any bias as clear as possible.

(8) What information can one glean from data selected for a certain effect?

I think this is kind of a moot question, since virtually all data are susceptible to some form of selection bias (scientists generally don’t write papers detailing all the analyses they conducted that didn’t pan out!). As I note above, I think it’s a bad idea to disregard effect sizes entirely; they’re actually what we should be focusing most of our attention on. Better to report confidence intervals that accurately reflect the selection procedure and make the uncertainty around the point estimate clear.

(9) Are visualizations of nonindependent data helpful to illustrate the claims of a paper?

Not in cases where there’s an extremely strong dependency between the selection criteria and the effect size estimate. In cases of weak to moderate dependency, visualization is fine so long as confidence bands are plotted alongside the best fit. Again, the key is to always be explicit about the limitations of the analysis and provide some indication of the uncertainty involved.

(10) Should data exploration be discouraged in favor of valid confirmatory analyses?

No. I agree with Poldrack’s sentiment here:

Our understanding of brain function remains incredibly crude, and limiting research to the current set of models and methods would virtually guarantee scientific failure. Exploration of new approaches is thus critical, but the findings must be confirmed using new samples and convergent methods.

(11) Is a confirmatory analysis safer than an exploratory analysis in terms of drawing neuroscientific conclusions?

In principle, sure, but in practice, it’s virtually impossible to determine which reported analyses really started out their lives as confirmatory analyses and which started life out as exploratory analyses and then mysteriously evolved into “a priori” predictions once the paper was written. I’m not saying there’s anything wrong with this–everyone reports results strategically to some extent–just that I don’t know that the distinction between confirmatory and exploratory analyses is all that meaningful in practice. Also, as the previous point makes clear, safety isn’t the only criterion we care about; we also want to discover new and unexpected findings, which requires exploration.

(12) What makes a whole-brain mapping analysis valid? What constitutes sufficient adjustment for multiple testing?

From a hypothesis testing standpoint, you need to ensure adequate control of the family-wise error (FWE) rate or false discovery rate (FDR). But as I suggested above, I think this only ensures validity in a limited sense; it doesn’t ensure that the results are actually going to be worth caring about. If you want to feel confident that any effects that survive are meaningfully large, you need to do the extra work up front and define what constitutes a meaningful effect size (and then test against that).

(13) How much power should a brain-mapping analysis have to be useful?

As much as possible! Concretely, the conventional target of 80% seems like a good place to start. But as I’ve argued before (e.g., here), that would require more than doubling conventional sample sizes in most cases. The reality is that fMRI studies are expensive, so we’re probably stuck with underpowered analyses for the foreseeable future. So we need to find other ways to compensate for that (e.g., relying more heavily on meta-analytic effect size estimates).

(14) In which circumstances are nonindependent selective analyses acceptable for scientific publication?

It depends on exactly what’s problematic about the analysis. Analyses that are truly circular and provide no new information should never be reported, but those constitute only a small fraction of all analyses. More commonly, the nonindependence simply amounts to selection bias: researchers tend to report only those results that achieve statistical significance, thereby inflating apparent effect sizes. I think the solution to this is to still report all key effect sizes, but to ensure they’re accompanied by confidence intervals and appropriate qualifiers.

ResearchBlogging.orgKriegeskorte N, Lindquist MA, Nichols TE, Poldrack RA, & Vul E (2010). Everything you never wanted to know about circular analysis, but were afraid to ask. Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism PMID: 20571517

estimating bias in text with Ruby

Over the past couple of months, I’ve been working on and off on a collaboration with my good friend Nick Holtzman and some other folks that focuses on ways to automatically extract bias from text using a vector space model. The paper is still in progress, so I won’t give much away here, except to say that Nick’s figured out what I think is a pretty clever way to show that, yes, Fox likes Republicans more than Democrats, and MSNBC likes Democrats more than Republicans. It’s not meant to be a surprising result, but simply a nice validation of the underlying method, which can be flexibly applied to all sorts of interesting questions.

The model we’re using is a simplified variant of Jones and Mewhort’s (2007) BEAGLE model. Essentially, similarity between words is quantified by looking at the degree to which words have similar co-occurrence patterns with other words. This basic idea is actually common to pretty much all vector space models, so in that sense, there’s not much new here (there’s plenty that’s new in Jones and Mewhort (2007), but we’re mostly leaving those features out for the sake of simplicity and computational speed). The novel aspect is the contrast coding of similarity terms in order to produce bias estimates. But you’ll have to wait for the paper to read more about that.

In the meantime, one thing we’ve tried to do is develop software that can be used to easily implement the kind of analyses we describe in the paper. With plenty of input from Nick and Mike Jones, I’ve written a set of tools in Ruby that’s now freely available for download here. The tools are actually bundled as a Ruby gem, so installation should be a snap on most platforms. We’re still working on documentation, so there’s no full-blown manual yet, but the quick-start guide should be sufficient to get many users up and running. And for people who share my love of Ruby and are interested in using the tools programmatically, there’s a fairly well-commented RDoc.

The code should really be considered an alpha release at the moment; I’m sure there are plenty of bugs (if you find any, email me!), and the feature set is currently pretty limited. Hopefully it’ll grow over time. I also plan to throw the code up on GitHub at some point in the near future so that anyone who’s interested can help out with the development. In the meantime, if you’re interested in semantic space models and want to play around with a crude (but relatively fast) implementation of one, there’s a (very) small chance you might find these tools useful.

not really a pyramid scheme; maybe a giant cesspool of little white lies?

There’s a long tradition in the academic blogosphere (and the offlinesphere too, I presume) of complaining that academia is a pyramid scheme. In a strict sense, I guess you could liken academia to a pyramid scheme, inasmuch as there are fewer open positions at each ascending level, and supply generally exceeds demand. But as The Prodigal Academic points out in a post today, this phenomenon is hardly exclusive to academia:

I guess I don’t really see much difference between academic job hunting, and job hunting in general. Starting out with undergrad admissions, there are many more qualified people for desirable positions than available slots. Who gets those slots is a matter of hard work (to get qualified) and luck (to be one of the qualified people who is “chosen”). So how is the TT any different from grad school admissions (in ANY prestige program), law firm partnership, company CEO, professional artist/athlete/performer, attending physician, investment banking, etc? The pool of qualified applicants is many times larger than the number of slots, and there are desirable perks to success (money/prestige/fame/security/intellectual freedom) making the supply of those willing to try for the goal pretty much infinite.

Maybe I have rose colored glasses on because I have always been lucky enough to find a position in research, but there are no guarantees in life. When I was interviewing in industry, I saw many really interesting jobs available to science PhD holders that were not in research. If I hadn’t gone to National Lab, I would have been happy to take on one of those instead. Sure, my life would be different, but it wouldn’t make my PhD a waste of time or a failed opportunity.

For the most part, I agree with this sentiment. I love doing research, and can’t imagine ever voluntarily leaving academia. But If I do end up having to leave–meaning, if I can’t find a faculty position when I go on the job market in the next year or two–I don’t think it’ll be the end of the world. I see job ads in industry all the time that looks really interesting, and on some level, I think I’d find almost any job that involves creative analysis of very large datasets (which there are plenty of these days!) pretty gratifying. And no matter what happens, I don’t think I’d ever view the time I’ve spent on my PhD and postdoc training as a waste of time, for the simple reason that I’ve really enjoyed most of it (there are, of course, the nasty bits, like writing the Nth chapter of a dissertation–but those are transient, fortunately). So in that sense, I think all the talk about academia being a pyramid scheme is kind of silly.

That said, there is one sticking point to the standard pyramid scheme argument I do agree with, which is that, when you’re starting out as a graduate student, no one really goes out of their way to tell you what the odds of getting a tenure-track faculty position actually are (and they’re not good). The problem being that most of the professors that prospective graduate students have interacted with, either as undergraduates, or in the context of applying to grad school, are precisely those lucky souls who’ve managed to secure faculty positions. So the difficulty of obtaining the same type of position isn’t always very salient to them.

I’m not saying faculty members lie outright to prospective graduate students, of course; I don’t doubt that if you asked most faculty point blank “what proportion of students in your department have managed to find tenure-track positions,” they’d give you an honest answer. But when you’re 22 or 23 years old (and yes, I recognize some graduate students are much older, but this is the mode) and you’re thinking of a career in research, it doesn’t always occur to you to ask that question. And naturally, departments that are trying to recruit your services are unlikely to begin their pitch by saying, “in the past 10 years, only about 12% of our graduates have gone on to tenure-track faculty positions”. So in that sense, I don’t think new graduate students are always aware of just how difficult it is to obtain an independent research position, statistically speaking. That’s not a problem for the (many) graduate students who don’t really have any intention of going into academia anyway, but I do think a large part of the disillusionment graduate students often experience is about the realization that you can bust your ass for five or six years working sixty hours a week, and still have no guarantee of finding a research job when you’re done. And that could be avoided to some extent by making a concerted effort to inform students up front of the odds they face if they’re planning on going down that path. So long as that information is made readily available, I don’t really see a problem.

Having said that, I’m now going to blatantly contradict myself (so what if I do? I am large! I contain multitudes!). You could, I think, reasonably argue that this type of deception isn’t really a problem, and that it’s actually necessary. For one thing, the white lies cut both ways. It isn’t just faculty who conveniently forget to mention that relatively few students will successfully obtain tenure-track positions; many graduate students nod and smile when asked if they’re planning a career in research, despite having no intention of continuing down that path past the PhD. I’ve occasionally heard faculty members complain that they need to do a better job filtering out those applicants who really truly are interested in a career in research, because they’re losing a lot of students to industry at the tail end. But I think this kind of magical mind-reading filter is a pipe dream, for precisely the reasons outlined above: if faculty aren’t willing to begin their recruitment speeches by saying “most of you probably won’t get research positions even if you want them,” they shouldn’t really complain when most students don’t come right out and say “actually, I just want a PhD because I think it’ll be something interesting to do for a few years and then I’ll be able to find a decent job with better hours later”.

The reality is that the whole enterprise may actually require subtle misdirection about people’s intentions. If every student applying to grad school knew exactly what the odds of getting a research position were, I imagine many fewer people who were serious about research would bother applying; you’d then get predominantly people who don’t really want to do research anyway. And if you could magically weed out the students who don’t want to do research, then (a) there probably wouldn’t be enough highly qualified students left to keep research programs afloat, and/or (b) there would be even more candidates applying for research positions, making things even harder for those students who do want careers in research. There’s probably no magical allocation of resources that optimizes everyone’s needs simultaneously; it could be that we’re more or less at a stable equilibrium point built on little white lies.

tl;dr : I don’t think academia is really a pyramid scheme; more like a giant cesspool of little white lies and subtle misinformation that indirectly serves most people’s interests. So, basically, it’s kind of like most other domains of life that involve interactions between many groups of people.

and the runner up is…

This one’s a bit of a head-scratcher. Thomson-Reuters just released its 2009 Journal Citation Report–essentially a comprehensive ranking of scientific journals by their impact factor (IF). The odd part, as reported by Bob Grant in The Scientist, is that the journal with the second-highest IF is Acta Crystallographica – Section A–ahead of heavyweights like the New England Journal of Medicine. For perspective, the same journal had an IF of 2.051 in 2008. The reason for the jump?

A single article published in a 2008 issue of the journal seems to be responsible for the meteoric rise in the Acta Crystallographica – Section A‘s impact factor. “A short history of SHELX,” by University of Göttingen crystallographer George Sheldrick, which reviewed the development of the computer system SHELX, has been cited more than 6,600 times, according to ISI. This paper includes a sentence that essentially instructs readers to cite the paper they’re reading — “This paper could serve as a general literature citation when one or more of the open-source SHELX programs (and the Bruker AXS version SHELXTL) are employed in the course of a crystal-structure determination.” (Note: This may be a good way to boost your citations.)

Setting aside the good career advice (and yes, I’ve made a mental note to include the phrase “this paper could serve as a general literature citation…” in my next paper), it’s perplexing that Thomson-Reuters didn’t downweight Acta Crystallographica‘s IF considerably given the obvious outlier. There’s no question they would have noticed that the second-ranked journal was only there in virtue of one article, so I’m curious what the thought process was. Perhaps the deliberation went something like this:

Thomson-Reuters statistician A: We need to take it out! We can’t have a journal with an impact factor of 2 last year beat out the NEJM!

Thomson-Reuters statistician B: But if we take it out, it’ll look like we tampered with the IF!

TRS-A: But we already tamper with the IF! No one knows how we come up with these numbers! Sometimes we can’t even replicate our own results ourselves! And anyway, it’s really not a big deal if we just leave the article in; scientists know better than to think Acta Crystallographica is the second most influential science journal on the planet. They’ll figure it out.

TRS-B: But that’s like asking them to just disregard our numbers! If you’re supposed to ignore the impact factor in cases where it contradicts your perception of journal quality, what’s the point of having an impact factor at all?

TRS-A: Beats me.

So okay, I’m sure it didn’t go down quite like that. But it’s still pretty weird.
And now, having bitched about how arbitrary the IF is, I’m going to go off and spend the next 15 minutes perusing the psychology and neuroscience journal rankings…

time-on-task effects in fMRI research: why you should care

There’s a ubiquitous problem in experimental psychology studies that use behavioral measures that require participants to make speeded responses. The problem is that, in general, the longer people take to do something, the more likely they are to do it correctly. If I have you do a visual search task and ask you to tell me whether or not a display full of letters contains a red ‘X’, I’m not going to be very impressed that you can give me the right answer if I let you stare at the screen for five minutes before responding. In most experimental situations, the only way we can learn something meaningful about people’s capacity to perform a task is by imposing some restriction on how long people can take to respond. And the problem that then presents is that any changes we observe in the resulting variable we care about (say, the proportion of times you successfully detect the red ‘X’) are going to be confounded with the time people took to respond. Raise the response deadline and performance goes up; shorten it and performance goes down.

This fundamental fact about human performance is commonly referred to as the speed-accuracy tradeoff. The speed-accuracy tradeoff isn’t a law in any sense; it allows for violations, and there certainly are situations in which responding quickly can actually promote accuracy. But as a general rule, when researchers run psychology experiments involving response deaadlines, they usually work hard to rule out the speed-accuracy tradeoff as an explanation for any observed results. For instance, if I have a group of adolescents with ADHD do a task requiring inhibitory control, and compare their performance to a group of adolescents without ADHD, I may very well find that the ADHD group performs more poorly, as reflected by lower accuracy rates. But the interpretation of that result depends heavily on whether or not there are also any differences in reaction times (RT). If the ADHD group took about as long on average to respond as the non-ADHD group, it might be reasonable to conclude that the ADHD group suffers a deficit in inhibitory control: they take as long as the control group to do the task, but they still do worse. On the other hand, if the ADHD group responded much faster than the control group on average, the interpretation would become more complicated. For instance, one possibility would be that the accuracy difference reflects differences in motivation rather than capacity per se. That is, maybe the ADHD group just doesn’t care as much about being accurate as about responding quickly. Maybe if you motivated the ADHD group appropriately (e.g., by giving them a task that was intrinsically interesting), you’d find that performance was actually equivalent across groups. Without explicitly considering the role of reaction time–and ideally, controlling for it statistically–the types of inferences you can draw about underlying cognitive processes are somewhat limited.

An important point to note about the speed-accuracy tradeoff is that it isn’t just a tradeoff between speed and accuracy; in principle, any variable that bears some systematic relation to how long people take to respond is going to be confounded with reaction time. In the world of behavioral studies, there aren’t that many other variables we need to worry about. But when we move to the realm of brain imaging, the game changes considerably. Nearly all fMRI studies measure something known as the blood-oxygen-level-dependent (BOLD) signal. I’m not going to bother explaining exactly what the BOLD signal is (there are plenty of other excellent explanations at varying levels of technical detail, e.g., here, here, or here); for present purposes, we can just pretend that the BOLD signal is basically a proxy for the amount of neural activity going on in different parts of the brain (that’s actually a pretty reasonable assumption, as emerging studies continue to demonstrate). In other words, a simplistic but not terribly inaccurate model is that when neurons in region X increase their firing rate, blood flow in region X also increases, and so in turn does the BOLD signal that fMRI scanners detect.

A critical question that naturally arises is just how strong the temporal relation is between the BOLD signal and underlying neuronal processes. From a modeling perspective, what we’d really like is a system that’s completely linear and time-invariant–meaning that if you double the duration of a stimulus presented to the brain, the BOLD response elicited by that stimulus also doubles, and it doesn’t matter when the stimulus is presented (i.e., there aren’t any funny interactions between different phases of the response, or with the responses to other stimuli). As it turns out, the BOLD response isn’t perfectly linear, but it’s pretty close. In a seminal series of studies in the mid-90s, Randy Buckner, Anders Dale and others showed that, at least for stimuli that aren’t presented extremely rapidly (i.e., a minimum of 1 – 2 seconds apart), we can reasonably pretend that the BOLD response sums linearly over time without suffering any serious ill effects. And that’s extremely fortunate, because it makes modeling brain activation with fMRI much easier to do. In fact, the vast majority of fMRI studies, which employ what are known as rapid event-related designs, implicitly assume linearity. If the hemodynamic response wasn’t approximately linear, we would have to throw out a very large chunk of the existing literature–or at least seriously question its conclusions.

Aside from the fact that it lets us model things nicely, the assumption of linearity has another critical, but underappreciated, ramification for the way we do fMRI research. Which is this: if the BOLD response sums approximately linearly over time, it follows that two neural responses that have the same amplitude but differ in duration will produce BOLD responses with different amplitudes. To characterize that visually, here’s a figure from a paper I published with Deanna Barch, Jeremy Gray, Tom Conturo, and Todd Braver last year:

plos_one_figure1

Each of these panels shows you the firing rates and durations of two hypothetical populations of neurons (on the left), along with the (observable) BOLD response that would result (on the right). Focus your attention on panel C first. What this panel shows you is what, I would argue, most people intuitively think of when they come across a difference in activation between two conditions. When you see time courses that clearly differ in their amplitude, it’s very natural to attribute a similar difference to the underlying neuronal mechanisms, and suppose that there must just be more firing going on in one condition than the other–where ‘more’ is taken to mean something like “firing at a higher rate”.

The problem, though, is that this inference isn’t justified. If you look at panel B, you can see that you get exactly the same pattern of observed differences in the BOLD response even when the amplitude of neuronal activation is identical, simply because there’s a difference in duration. In other words, if someone shows you a plot of two BOLD time courses for different experimental conditions, and one has a higher amplitude than the other, you don’t know whether that’s because there’s more neuronal activation in one condition than the other, or if processing is identical in both conditions but simply lasts longer in one than in the other. (As a technical aside, this equivalence only holds for short trials, when the BOLD response doesn’t have time to saturate. If you’re using longer trials–say 4 seconds more more–then it becomes fairly easy to tell apart changes in duration from changes in amplitude. But the vast majority of fMRI studies use much shorter trials, in which case the problem I describe holds.)

Now, functionally, this has some potentially very serious implications for the inferences we can draw about psychological processes based on observed differences in the BOLD response. What we would usually like to conclude when we report “more” activation for condition X than condition Y is that there’s some fundamental difference in the nature of the processes involved in the two conditions that’s reflected at the neuronal level. If it turns out that the reason we see more activation in one condition than the other is simply that people took longer to respond in one condition than in the other, and so were sustaining attention for longer, that can potentially undermine that conclusion.

For instance, if you’re contrasting a feature search condition with a conjunction search condition, you’re quite likely to observe greater activation in regions known to support visual attention. But since a central feature of conjunction search is that it takes longer than a feature search, it could theoretically be that the same general regions support both types of search, and what we’re seeing is purely a time-on-task effect: visual attention regions are activated for longer because it takes longer to complete the conjunction search, but these regions aren’t doing anything fundamentally different in the two conditions (at least at the level we can see with fMRI). So this raises an issue similar to the speed-accuracy tradeoff we started with. Other things being equal, the longer it takes you to respond, the more activation you’ll tend to see in a given region. Unless you explicitly control for differences in reaction time, your ability to draw conclusions about underlying neuronal processes on the basis of observed BOLD differences may be severely hampered.

It turns out that very few fMRI studies actually control for differences in RT. In an elegant 2008 study discussing different ways of modeling time-varying signals, Jack Grinband and colleagues reviewed a random sample of 170 studies and found that, “Although response times were recorded in 82% of event-related studies with a decision component, only 9% actually used this information to construct a regression model for detecting brain activity”. Here’s what that looks like (Panel C), along with some other interesting information about the procedures used in fMRI studies:

grinband_figure
So only one in ten studies made any effort to control for RT differences; and Grinband et al argue in their paper that most of those papers didn’t model RT the right way anyway (personally I’m not sure I agree; I think there are tradeoffs associated with every approach to modeling RT–but that’s a topic for another post).

The relative lack of attention to RT differences is particularly striking when you consider what cognitive neuroscientists do care a lot about: differences in response accuracy. The majority of researchers nowadays make a habit of discarding all trials on which participants made errors. The justification we give for this approach–which is an entirely reasonable one–is that if we analyzed correct and incorrect trials together, we’d be confounding the processes we care about (e.g., differences between conditions) with activation that simply reflects error-related processes. So we drop trials with errors, and that gives us cleaner results.

I suspect that the reasons for our concern with accuracy effects but not RT effects in fMRI research are largely historical. In the mid-90s, when a lot of formative cognitive neuroscience was being done, people (most of them then located in Pittsburgh, working in Jonathan Cohen‘s group) discovered that the brain doesn’t like to make errors. When people make mistakes during task performance, they tend to recognize that fact; on a neural level, frontoparietal regions implicated in goal-directed processing–and particularly the anterior cingulate cortex–ramp up activation substantially. The interpretation of this basic finding has been a source of much contention among cognitive neuroscientists for the past 15 years, and remains a hot area of investigation. For present purposes though, we don’t really care why error-related activation arises; the point is simply that it does arise, and so we do the obvious thing and try to eliminate it as a source of error from our analyses. I suspect we don’t do the same for RT not because we lack principled reasons to, but because there haven’t historically been clear-cut demonstrations of the effects of RT differences on brain activation.

The goal of the 2009 study I mentioned earlier was precisely to try to quantify those effects. The hypothesis my co-authors and I tested was straightforward: if brain activity scales approximately linearly with RT (as standard assumptions would seem to entail), we should see a strong “time-on-task” effect in brain areas that are associated with the general capacity to engage in goal-directed processing. In other words, on trials when people take longer to respond, activation in frontal and parietal regions implicated in goal-directed processing and cognitive control should increase. These regions are often collectively referred to as the “task-positive” network (Fox et al., 2005), in reference to the fact that they tend to show activation increases any time people are engaging in goal-directed processing, irrespective of the precise demands of the task. We figured that identifying a time-on-task effect in the task-positive network would provide a nice demonstration of the relation between RT differences and the BOLD response, since it would underscore the generality of the problem.

Concretely, what we did was take five datasets that were lying around from previous studies, and do a multi-study analysis focusing specifically on RT-related activation. We deliberately selected studies that employed very different tasks, designs, and even scanners, with the aim of ensuring the generalizability of the results. Then, we identified regions in each study in which activation covaried with RT on a trial-by-trial basis. When we put all of the resulting maps together and picked out only those regions that showed an association with RT in all five studies, here’s the map we got:

plos_one_figure2

There’s a lot of stuff going on here, but in the interest of keeping this post short slightly less excruciatingly long, I’ll stick to the frontal areas. What we found, when we looked at the timecourse of activation in those regions, was the predicted time-on-task effect. Here’s a plot of the timecourses from all five studies for selected regions:

plos_one_figure4

If you focus on the left time course plot for the medial frontal cortex (labeled R1, in row B), you can see that increases in RT are associated with increased activation in medial frontal cortex in all five studies (the way RT effects are plotted here is not completely intuitive, so you may want to read the paper for a clearer explanation). It’s worth pointing out that while these regions were all defined based on the presence of an RT effect in all five studies, the precise shape of that RT effect wasn’t constrained; in principle, RT could have exerted very different effects across the five studies (e.g., positive in some, negative in others; early in some, later in others; etc.). So the fact that the timecourses look very similar in all five studies isn’t entailed by the analysis, and it’s an independent indicator that there’s something important going on here.

The clear-cut implication of these findings is that a good deal of BOLD activation in most studies can be explained simply as a time-on-task effect. The longer you spend sustaining goal-directed attention to an on-screen stimulus, the more activation you’ll show in frontal regions. It doesn’t much matter what it is that you’re doing; these are ubiquitous effects (since this study, I’ve analyzed many other datasets in the same way, and never fail to find the same basic relationship). And it’s worth keeping in mind that these are just the regions that show common RT-related activation across multiple studies; what you’re not seeing are regions that covary with RT only within one (or for that matter, four) studies. I’d argue that most regions that show involvement in a task are probably going to show variations with RT. After all, that’s just what falls out of the assumption of linearity–an assumption we all depend on in order to do our analyses in the first place.

Exactly what proportion of results can be explained away as time-on-task effects? That’s impossible to determine, unfortunately. I suspect that if you could go back through the entire fMRI literature and magically control for trial-by-trial RT differences in every study, a very large number of published differences between experimental conditions would disappear. That doesn’t mean those findings were wrong or unimportant, I hasten to note; there are many cases in which it’s perfectly appropriate to argue that differences between conditions should reflect a difference in quantity rather than quality. Still, it’s clear that in many cases that isn’t the preferred interpretation, and controlling for RT differences probably would have changed the conclusions. As just one example, much of what we think of as a “conflict” effect in the medial frontal cortex/anterior cingulate could simply reflect prolonged attention on high-conflict trials. When you’re experiencing cognitive difficulty or conflict, you tend to slow down and take longer to respond, which is naturally going to produce BOLD increases that scale with reaction time. The question as to what remains of the putative conflict signal after you control for RT differences is one that hasn’t really been adequately addressed yet.

The practical question, of course, is what we should do about this. How can we minimize the impact of the time-on-task effect on our results, and, in turn, on the conclusions we draw? I think the most general suggestion is to always control for reaction time differences. That’s really the only way to rule out the possibility that any observed differences between conditions simply reflect differences in how long it took people to respond. This leaves aside the question of exactly how one should model out the effect of RT, which is a topic for another time (though I discuss it at length in the paper, and the Grinband paper goes into even more detail). Unfortunately, there isn’t any perfect solution; as with most things, there are tradeoffs inherent in pretty much any choice you make. But my personal feeling is that almost any approach one could take to modeling RT explicitly is a big step in the right direction.

A second, and nearly as important, suggestion is to not only control for RT differences, but to do it both ways. Meaning, you should run your model both with and without an RT covariate, and carefully inspect both sets of results. Comparing the results across the two models is what really lets you draw the strongest conclusions about whether activation differences between two conditions reflect a difference of quality or quantity. This point applies regardless of which hypothesis you favor: if you think two conditions draw on very similar neural processes that differ only in degree, your prediction is that controlling for RT should make effects disappear. Conversely, if you think that a difference in activation reflects the recruitment of qualitatively different processes, you’re making the prediction that the difference will remain largely unchanged after controlling for RT. Either way, you gain important information by comparing the two models.

The last suggestion I have to offer is probably obvious, and not very helpful, but for what it’s worth: be cautious about how you interpret differences in activation any time there are sizable differences in task difficulty and/or mean response time. It’s tempting to think that if you always analyze only trials with correct responses and follow the suggestions above to explicitly model RT, you’ve done all you need in order to perfectly control for the various tradeoffs and relationships between speed, accuracy, and cognitive effort. It really would be nice if we could all sleep well knowing that our data have unambiguous interpretations. But the truth is that all of these techniques for “controlling” for confounds like difficulty and reaction time are imperfect, and in some cases have known deficiencies (for instance, it’s not really true that throwing out error trials eliminates all error-related activation from analysis–sometimes when people don’t know the answer, they guess right!). That’s not to say we should stop using the tools we have–which offer an incredibly powerful way to peer inside our gourds–just that we should use them carefully.

ResearchBlogging.org

Yarkoni T, Barch DM, Gray JR, Conturo TE, & Braver TS (2009). BOLD correlates of trial-by-trial reaction time variability in gray and white matter: a multi-study fMRI analysis. PloS one, 4 (1) PMID: 19165335

Grinband J, Wager TD, Lindquist M, Ferrera VP, & Hirsch J (2008). Detection of time-varying signals in event-related fMRI designs. NeuroImage, 43 (3), 509-20 PMID: 18775784

elsewhere on the net

Some neat links from the past few weeks:

  • You Are No So Smart: A celebration of self-delusion. An excellent blog by journalist David McCraney that deconstructs common myths about the way the mind works.
  • NPR has a great story by Jon Hamilton about the famous saga of Einstein’s brain and what it’s helped teach us about brain function. [via Carl Zimmer]
  • The Neuroskeptic has a characteristically excellent 1,000 word explanation of how fMRI works.
  • David Rock has an interesting post on some recent work from Baumeister’s group purportedly showing that it’s good to believe in free will (whether or not it exists). My own feeling about this is that Baumeister’s not really studying people’s philosophical views about free will, but rather a construct closely related to self-efficacy and locus of control. But it’s certainly an interesting line of research.
  • The Prodigal Academic is a great new blog about all things academic. I’ve found it particularly interesting since several of the posts so far have been about job searches and job-seeking–something I’ll be experiencing my fill of over the next few months.
  • Prof-like Substance has a great 5-part series (1, 2, 3, 4, 5) on how blogging helps him as an academic. My own (much less eloquent) thoughts on that are here.
  • Cameron Neylon makes a nice case for the development of social webs for data mining.
  • Speaking of data mining, Michael Driscoll of Dataspora has an interesting pair of posts extolling the virtues of Big Data.
  • And just to balance things out, there’s this article in the New York Times by John Allen Paulos that offers some cautionary words about the challenges of using empirical data to support policy decisions.
  • On a totally science-less note, some nifty drawings (or is that photos?) by Ben Heine (via Crooked Brains):

fMRI, not coming to a courtroom near you so soon after all

That’s a terribly constructed title, I know, but bear with me. A couple of weeks ago I blogged about a courtroom case in Tennessee where the defense was trying to introduce fMRI to the courtroom as a way of proving the defendant’s innocence (his brain, apparently, showed no signs of guilt). The judge’s verdict is now in, and…. fMRI is out. In United States v. Lorne Semrau, Judge Pham recommended that the government’s motion to exclude fMRI scans from consideration be granted. That’s the outcome I think most respectable cognitive neuroscientists were hoping for; as many people associated with the case or interviewed about it have noted (and as the judge recognized), there just isn’t a shred of evidence to suggest that fMRI has any utility as a lie detector in real-world situations.

The judge’s decision, which you can download in PDF form here (hat-tip: Thomas Nadelhoffer), is really quite elegant, and worth reading (or at least skimming through). He even manages some subtle snark in places. For instance (my italics):

Regarding the existence and maintenance of standards, Dr. Laken testified as to the protocols and controlling standards that he uses for his own exams. Because the use of fMRI-based lie detection is still in its early stages of development, standards controlling the real-life application have not yet been established. Without such standards, a court cannot adequately evaluate the reliability of a particular lie detection examination. Cordoba, 194 F.3d at 1061. Assuming, arguendo, that the standards testified to by Dr. Laken could satisfy Daubert, it appears that Dr. Laken violated his own protocols when he re-scanned Dr. Semrau on the AIMS tests SIQs, after Dr. Semrau was found “deceptive“ on the first AIMS tests scan. None of the studies cited by Dr. Laken involved the subject taking a second exam after being found to have been deceptive on the first exam. His decision to conduct a third test begs the question whether a fourth scan would have revealed Dr. Semrau to be deceptive again.

The absence of real-life error rates, lack of controlling standards in the industry for real-life exams, and Dr. Laken’s apparent deviation from his own protocols are negative factors in the analysis of whether fMRI-based lie detection is scientifically valid. See Bonds, 12 F.3d at 560.

The reference here is to the fact that Laken and his company scanned Semrau (the defendant) on three separate occasions. The first two scans were planned ahead of time, but the third apparently wasn’t:

From the first scan, which included SIQs relating to defrauding the government, the results showed that Dr. Semrau was “not deceptive.“ However, from the second scan, which included SIQs relating to AIMS tests, the results showed that Dr. Semrau was “being deceptive.“ According to Dr. Laken, “testing indicates that a positive test result in a person purporting to tell the truth is accurate only 6% of the time.“ Dr. Laken also believed that the second scan may have been affected by Dr. Semrau’s fatigue. Based on his findings on the second test, Dr. Laken suggested that Dr. Semrau be administered another fMRI test on the AIMS tests topic, but this time with shorter questions and conducted later in the day to reduce the effects of fatigue. … The third scan was conducted on January 12, 2010 at around 7:00 p.m., and according to Dr. Laken, Dr. Semrau tolerated it well and did not express any fatigue. Dr. Laken reviewed this data on January 18, 2010, and concluded that Dr. Semrau was not deceptive. He further stated that based on his prior studies, “a finding such as this is 100% accurate in determining truthfulness from a truthful person.“

I may very well be misunderstanding something here (and so might the judge), but if the positive predictive value of the test is only 6%, I’m guessing that the probability that the test is seriously miscalibrated is somewhat higher than 6%. Especially since the base rate for lying among people who are accused of committing serious fraud is probably reasonably high (this matters, because when base rates are very low, low positive predictive values are not unexpected). But then, no one really knows how to calibrate these tests properly, because the data you’d need to do that simply don’t exist. Serious validation of fMRI as a tool for lie detection would require assembling a large set of brain scans from defendants accused of various crimes (real crimes, not simulated ones) and using that data to predict whether those defendants were ultimately found guilty or not. There really isn’t any substitute for doing a serious study of that sort, but as far as I know, no one’s done it yet. Fortunately, the few judges who’ve had to rule on the courtroom use of fMRI seem to recognize that.

Regarding the existence and maintenance of standards, Dr. Laken testified as to the protocols and controlling standards that he uses for his own exams. Because the use of fMRI-based lie detection is still in its early stages of development, standards controlling the real-life application have not yet been established. Without such standards, a court cannot adequately evaluate the reliability of a particular lie detection examination. Cordoba, 194 F.3d at 1061. Assuming, arguendo, that the standards testified to by Dr. Laken could satisfy Daubert, it appears that Dr. Laken violated his own protocols when he re-scanned Dr. Semrau on the AIMS tests SIQs, after Dr. Semrau was found “deceptive“ on the first AIMS tests scan. None of the studies cited by Dr. Laken involved the subject taking a second exam after being found to have been deceptive on the first exam. His decision to conduct a third test begs the question whether a fourth scan would have revealed Dr. Semrau to be deceptive again.
The absence of real-life error rates, lack of controlling standards in the industry for real-life exams, and Dr. Laken’s apparent deviation from his own protocols are negative factors in the analysis of whether fMRI-based lie detection is scientifically valid. See Bonds, 12 F.3d at 560

the perils of digging too deep

Another in a series of posts supposedly at the intersection of fiction and research methods, but mostly just an excuse to write ridiculous stories and pretend they have some sort of moral.


Dr. Rickles the postdoc looked a bit startled when I walked into his office. He was eating a cheese sandwich and watching a chimp on a motorbike on his laptop screen.

“YouTube again?” I asked.

“Yes,” he said. “It’s lunch.”

“It’s 2:30 pm,” I said, pointing to my watch.

“Still my lunch hours.”

Lunch hours for Rickles were anywhere from 11 am to 4 pm. It depended on exactly when you walked in on him doing something he wasn’t supposed to; that was the event that marked the onset of Lunch.

“Fair enough,” I said. “I just stopped by to see how things were going.”

“Oh, quite well.” said Rickles. “Things are going well. I just found a video of a chimp and a squirrel riding a motorbike together. They aren’t even wearing helmets! I’ll send you the link.”

“Please don’t. I don’t like squirrels. But I meant with work. How’s the data looking.”

He shot me a pained look, like I’d just caught him stealing video game money from his grandmother.

“The data are TERRIBLE,” he said in all capital letters.

I wasn’t terribly surprised at the revelation; I’d handed Rickles the dataset only three days prior, taking care not to  tell him it was the dataset from hell. Rickles was the fourth or fifth person in the line of succession; the data had been handed down from postdoc to graduate student to postdoc for several years now. Everyone in the lab wanted to take a crack at it when they first heard about it, and no one in the lab wanted anything to do with it once they’d taken a peek. I’d given it to Rickles in part to teach him a lesson; he’d been in the lab for several weeks now and somehow still seemed happy and self-assured.

“Haven’t found anything interesting yet?” I asked. “I thought maybe if you ran the Flimflan test on the A-trax, you might get an effect. Or maybe if you jimmied the cryptos on the Borgatron…”

“No, no,” Rickles interrupted, waved me off. “The problem isn’t that there’s nothing interesting in the data; it’s that there’s too MUCH stuff. There are too MANY results. The story is too COMPLEX.”

That didn’t compute for me, so I just stared at him blankly. No one ever found COMPLEX effects in my lab. We usually stopped once we found SIMPLE effects.

Rickles was unimpressed.

“You follow what I’m saying, Guy? There are TOO-MANY-EFFECTS. There’s too much going on in the data.”

“I don’t see how that’s possible,” I said. “Keith, Maria, and Lakshmi each spent weeks on this data and found nothing.”

“That,” said Rickles, “is because Keith, Maria, and Lakshmi never thought to apply the Epistocene Zulu transform to the data.”

The Epistocene Zulu transform! It made perfect sense when you thought about it; so why hadn’t I ever thought about it? Who was Rickles cribbing analysis notes from?

“Pull up the data,” I said excitedly. “I want to see what you’re talking about.”

“Alright, alright. Lunch hours are over now anyway.”

He grudgingly clicked on the little X on his browser. Then he pulled up a spreadsheet that must have had a million columns in it. I don’t know where they’d all come from; it had only had sixteen thousand or so when I’d had the hard drives delivered to his office.

“Here,” said Rickles, showing me the output of the Pear-sampled Tea test. “There’s the A-trax, and there’s its Nuffton index, and there’s the Zimming Range. Look at that effect. It’s bigger than the zifflon correlation Yehudah’s group reported in Nature last year.”

“Impressive,” I said, trying to look calm and collected. But in my head, I was already trying to figure out how I’d ask the department chair for a raise once this finding was published. Each point on that Zimming Range is worth at least $500, I thought.

“Are there any secondary analyses we could publish alongside that,” I asked.

“Oh, I don’t think you want to publish that,” Rickles laughed.

“Why the hell not? It could be big! You just said yourself it was a giant effect!”

“Oh sure. It’s a big effect. But I don’t believe it for one second.”

“Why not? What’s not to like? This finding make’s Yehudah’s paper look like a corn dog!”

I recognized, in the course of uttering those words, that they did not constitute the finest simile ever produced.

“Well, there are two massive outliers, for one. If you eliminate them, the effect is much smaller. And if you take into consideration the Gupta skew because the data were collected with the old reverberator, there’s nothing left at all.”

“Okay, fine,” I muttered. “Is there anything else in the data?”

“Sure, tons of things. Like, for example, there’s a statistically significant gamma reduction.”

“A gamma reduction? Are you sure? Or do you mean beta,” I asked.

“Definitely gamma,” said Rickles. “There’s nothing in the betas, deltas, or thetas. I checked.”

“Okay. That sounds potentially interesting and publishable. But I bet you’re going to tell me why we shouldn’t believe that result, either, right?”

“Well,” said Rickles, looking a bit self-conscious, “it’s just that it’s a pretty fine-grained analysis; you’re not really leaving a lot of observations when you slice it up that thin. And the weird thing about the gamma reduction is that it is essentially tantamount to accepting a null effect; this was Jayaraman’s point in that article in Statistica Splenda last month.”

“Sure, the Gerryman article, right. I read that. Forget the gamma reduction. What else?”

“There are quite a few schweizels,” Rickles offered, twisting the cap off a beer that had appeared out of the minibar under his desk.

I looked at him suspiciously. I suspected it was a trap; Rickels knew how much I loved Schweizel units. But I still couldn’t resist. I had to know.

“How many schweizels are there,” I asked, my hand clutching at the back of a nearby chair to help keep me steady.

“Fourteen,” Rickles said matter-of-factedly.

“Fourteen!” I gasped. “That’s a lot of schweizels!”

“It’s not bad,” said Rickles. “But the problem is, if you look at the B-trax, they also have a lot of schweizels. Seventeen of them, actually.”

“Seventeen schweizels!” I exclaimed. “That’s impossible! How can there be so many Schweizel units in one dataset!”

“I’m not sure. But… I can tell you that if you normalize the variables based on the Smith-Gill ratio, the effect goes away completely.”

There it was; the sound of the other shoe dropping. My heart gave a little cough–not unlike the sound your car engine makes in the morning when it’s cold and it wants you to stop provoking it and go back to bed. It was aggravating, but I understood what Rickles was saying. You couldn’t really say much about the Zimming Range unless your schweizel count was properly weighted. Still, I didn’t want to just give up on the schweizels entirely. I’d spent too much of my career delicately massaging schweizels to give up without one last tug.

“Maybe we can just say that the A-trax/Nuffton relationship is non-linear?” I suggested.

“Non-linear?” Rickles snorted. “Only if by non-linear you mean non-real! If it doesn’t survive Smith-Gill, it’s not worth reporting!”

I grudgingly conceded the point.

“What about the zifflons? Have you looked at them at all? It wouldn’t be so novel given Yehudah’s work, but we might still be able to get it into some place like Acta Ziffletica if there was an effect…”

“Tried it. There isn’t really any A-trax influence on zifflons. Or a B-trax effect, for that matter. There is a modest effect if you generate the Mish component for all the trax combined and look only at that. But that’s a lot of trax, and we’re not correcting for multiple Mishing, so I don’t really trust it…”

I saw that point too, and was now nearing despondency. Rickles had shot down all my best ideas one after the other. I wondered how I’d convince the department chair to let me keep my job.

Then it came to me in a near-blinding flash of insight. Near blinding, because I smashed my forehead on the overhead chandelier jumping out of my chair. An inch lower, and I’d have lost both eyes.

“We need to get that chandelier replaced,” I said, clutching my head in my hands. “It has no business hanging around in an office like this.”

“We need to get it replaced,” Rickles agreed. “I’ll do it tomorrow during my lunch hours.”

I knew that meant the chandelier would be there forever–or at least as long as Rickles inhabited the office.

“Have you tried counting the Dunams,” I suggested, rubbing my forehead delicately and getting back to my brilliant idea.

“No,” he said, leaning forward in his chair slightly. “I didn’t count Dunams.”

Ah-hah! I thought to myself. Not so smart are we now! The old boy’s still got some tricks up his sleeve.

“I think you should count the Dunams,” I offered sagely. “That always works for me. I do believe it might shed some light on this problem.”

“Well…” said Rickles, shaking his head slightly, “maaaaaybe. But Li published a paper in Psykometrika last year showing that Dunam counting is just a special case of Klein’s occidental protrusion method. And Klein’s method is more robust to violations of normality. So I used that. But I don’t really know how to interpret the results, because the residual is negative.”

I really had no idea either. I’d never come across a negative Dunam residual, and I’d never even heard of occidental protrusion. As far as I was concerned, it sounded like a made-up method.

“Okay,” I said, sinking back into my chair, ready to give up. “You’re right. This data… I don’t know. I don’t know what it means.”

I should have expected it, really; it was, after all, the dataset from hell. I was pretty sure my old RA had taken a quick jaunt through purgatory every morning before settling into the bench to run some experiments.

“I told you so,” said Rickles, putting his feet up on the desk and handing me a beer I didn’t ask for. “But don’t worry about it too much. I’m sure we’ll figure it out eventually. We probably just haven’t picked the right transformation yet. There’s Nordstrom, El-Kabir, inverse Zulu…”

He turned to his laptop and double-clicked an icon on the desktop that said “YouTube”.

“…or maybe you can just give the data to your new graduate student when she starts in a couple of weeks,” he said as an afterthought.

In the background, a video of a chimp and a puppy driving a Jeep started playing on a discolored laptop screen.

I mulled it over. Should I give the data to Josephine? Well, why not? She couldn’t really do any worse with it, and it would be a good way to break her will quickly.

“That’s not a bad idea, Rickles,” I said. “In fact, I think it might be the best idea you’ve had all week. Boy, that chimp is a really aggressive driver. Don’t drive angry, chimp! You’ll have an accid–ouch, that can’t be good.”

The

perils of digging too deep

Dr. Rickles the postdoc looked a bit startled when I walked into his office. He was eating a cheese sandwich and watching a chimp on a motorbike on his laptop screen.
“YouTube again?” I asked.
“Yes,” he said. “It’s lunch.”
“It’s 2:30 pm,” I said, pointing to my watch.
“Still my lunch hours.”
Lunch hours for Rickles were anywhere from 11 am to 4 pm. It depended on exactly when you walked in on him doing something he wasn’t supposed to; that was the event that marked the onset of Lunch.
“Fair enough,” I said. “I just stopped by to see how things were going.”
“Oh, quite well.” said Rickles. “Things are going well. I just found a video of a chimp and a squirrel riding a motorbike together. They aren’t even wearing helmets! I’ll send you the link.”
“Please don’t. I don’t like squirrels. But I meant with work. How’s the data looking.”
He shot me a pained look, like I’d just caught him stealing video game money from his grandmother.
“The data are TERRIBLE,” he said in all capital letters.
I wasn’t terribly surprised at that revelation; I’d handed Rickles the dataset only three days prior, taking care not to  tell him it was the dataset from hell. Rickles was the fourth or fifth person in the line of succession; the data had been handed down from postdoc to graduate student to postdoc for several years now. Everyone in the lab wanted to take a crack at it when they first heard about it, and no one in the lab wanted anything to do with it once they’d taken a peek. I’d given it to Rickles in part to teach him a lesson; he’d been in the lab for several weeks now and somehow still seemed happy and self-assured.
“Haven’t found anything interesting yet?” I asked. “I thought maybe if you ran the Flimflan test on the A-trax, you might get an effect. Or maybe if you jimmied the cryptos on the Borgatron…”
“No, no,” Rickles interrupted, waved me off. “The problem isn’t that there’s nothing interesting in the data; it’s that there’s too MUCH stuff. There are too MANY results. The story is too COMPLEX.”
That didn’t compute for me, so I just stared at him blankly. No one ever found COMPLEX effects in my lab. We usually stopped once we found SIMPLE effects.
Rickles was unimpressed.
“You follow what I’m saying, Guy? There are TOO-MANY-EFFECTS. There’s too much going on in the data.”
“I don’t see how that’s possible,” I said. “Keith, Maria, and Lakshmi each spent weeks on this data and found *nothing*.”
“That,” said Rickles, “is because Keith, Maria, and Lakshmi never thought to apply the Epistocene Zulu transform to the data.”
The Epistocene Zulu transform! It made perfect sense when you thought about it; so why hadn’t I ever thought about it? Who was Rickles cribbing analysis notes from?
“Pull up the data,” I said excitedly. “I want to see what you’re talking about.”
“Alright, alright. Lunch hours are over now anyway.”
He grudgingly clicked on the little X on his browser. Then he pulled up a spreadsheet that must have had a million columns in it. I don’t know where they’d all come from; it had only had sixteen thousand or so when I’d had the hard drives delivered to his office.
“Here,” said Rickles, showing me the output of the Pear-sampled Tea test. “There’s the A-trax, and there’s its Nuffton index, and there’s the Zimming Range. Look at that effect. It’s bigger than the zifflon correlation Yehudah’s group reported in Nature last year.”
“Impressive,” I said, trying to look calm and collected. But in my head, I was already trying to figure out how I’d ask the department chair for a raise once this finding was published. *Each point on that Zimming Range is worth at least $500*, I thought.
“Are there any secondary analyses we could publish alongside that,” I asked.
“Oh, I don’t think you want to publish *that*,” Rickles laughed.
“Why the hell not? It could be big! You just said yourself it was a giant effect!”
“Oh *sure*. It’s a big effect. But I don’t believe it for one second.”
“Why not? What’s not to like? This finding make’s Yehudah’s paper look like a corn dog!”
I recognized, in the course of uttering those words, that they did not constitute the finest simile ever.
“Well, there are two massive outliers, for one. If you eliminate them, the effect is much smaller. And if you take into consideration the Gupta skew because the data were collected with the old reverberator, there’s nothing left at all.”
“Okay, fine,” I muttered. “Is there anything else in the data?”
“Sure, tons of things. Like, for example, there’s a statistically significant Gamma reduction.”
“A gamma reduction? Are you sure? Or do you mean Beta,” I asked.
“Definitely gamma,” said Rickles. “There’s nothing in the betas, deltas, or thetas. I looked.”
“Okay. That sounds potentially interesting and publishable. But I bet you’re going to tell me why we shouldn’t believe that result, either, right?”
“Well,” said Rickles, looking a bit self-conscious, “it’s just that it’s a pretty fine-grained analysis; you’re not really leaving a lot of observations when you slice it up that thin. And the weird thing about the gamma reduction is that it is essentially tantamount to accepting a null effect; this was Jayaraman’s point in that article in *Statistica Splenda* last month.”
“Sure, the Gerryman article, right. Okay. Forget the gamma reduction. What else?”
“There are quite a few Schweizels,” Rickles offered, twisting the cap off a beer that had appeared out of the minibar under his desk.
I looked at him suspiciously. I suspected it was a trap; Rickels knew how much I loved Schweizel units. But I still couldn’t resist. I had to know.
“How many Schweizels are there,” I asked, my hand clutching at the back of a nearby chair to help me stay upright.
“Fourteen,” Rickles said matter-of-factedly.
“Fourteen!” I gasped. “That’s a lot of Schweizels!”
“It’s not bad,” said Rickles. “But the problem is, if you look at the B-trax, they also have a lot of Schweizels. Seventeen of them, actually.”
“Seventeen Schweizels!” I exclaimed. “That’s impossible! How can there be so many Schweizel units in one dataset!”
“I’m not sure. But… I can tell you that if you normalize the variables based on the Smith-Gill ratio, the effect goes away completely.”
There it was; the sound of the other shoe dropping. My heart gave a little cough–not unlike the sound your car engine makes in the morning when it’s cold and it wants you to go back to bed and stop stressing it out. It was aggravating, but I understood what Rickles was saying. You couldn’t really say much about the Zimming Range unless your Schweizel count was properly weighted. Still, I didn’t want to just give up on the Schweizels entirely.
“Maybe we can just say that the A-trax/Nuffton relationship is non-linear,” I proposed.
“Non-linear?” Rickles snorted. “Only if by non-linear you mean non-real! If it doesn’t survive Smith-Gill, it’s not worth reporting!”
I grudgingly conceded the point.
“What about the zifflons? Have you looked at them at all? It wouldn’t be so novel given Yehudah’s work, but we might still be able to get it into some place like *Acta Ziffletica* if there was an effect…”
“Tried it. There isn’t really any A-trax influence on zifflons. Or a B-trax effect, for that matter. There *is* a modest effect if you generate the Mish component for all the trax combined and look only at that. But that’s a lot of trax, and we’re not correcting for multiple Mishing, so I don’t really trust it…”
I saw that point too, and was now nearing despondency. Rickles had shot down all my best ideas one after the other. What else was left?
Then it came to me in a near-blinding flash of insight. *Near* blinding, because I smashed my forehead on the overhead chandelier jumping out of my chair. An inch lower, and I’d have lost both eyes.
“We need to get that chandelier replaced,” I said, clutching my head in my hands. “It has no business hanging around in an office like this.”
“We need to get it replaced,” Rickles agreed. “I’ll do it tomorrow during my lunch hours.”
I knew that meant the chandelier would be there forever–or at least as long as Rickles inhabited the office.
“Have you tried counting the Dunams,” I suggested, rubbing my forehead delicately and getting back to my brilliant idea.
“No,” he said, leaning forward in his chair slightly. “I didn’t count Dunams.”
Ah-hah! I thought to myself. Not so smart are we now! The old boy’s still got some tricks up his sleeve.
“I think you should count the Dunams,” I offered sagely. “That always works for me. I do believe it might shed some light on this problem.”
“Well…” said Rickles, shaking his head slightly, “maaaaaybe. But Li published a paper in Psychometrika last year showing that Dunam counting is just a special case of Klein’s occidental protrusion method. And Klein’s method is more robust to violations of normality. So I used that. But I don’t really know how to interpret the results, because the residual is *negative*.”
I really had no idea either. I’d never come across a negative Dunam residual, and I’d never even heard of occidental protrusion. As far as I was concerned, it sounded like a made-up method.
“Okay,” I said, sinking back into my chair, ready to give up. “You’re right. This data… I don’t know. I don’t know what it means.” I should have expected it, really; it was, after all, the dataset from hell. I was pretty sure my old RA had collected it after taking a quick jaunt through purgatory every morning.
“I told you so,” said Rickles, putting his feet up on the desk and handing me a beer I didn’t ask for. “But don’t worry about it too much. I’m sure we’ll figure it out eventually. We probably just haven’t picked the right transformation yet.”
He turned to his laptop and double-clicked an icon on the desktop that said “YouTube”.
“Maybe you can give the data to your new graduate student when she starts in a couple of weeks,” he said as an afterthought.
In the background, a video of a chimp and a puppy driving a Jeep started playing on a discolored laptop screen.
I mulled it over. Should I give the data to Josephine? Well, why not? She couldn’t really do any *worse* with it, and it *would* be a good way to break her will in a hurry.
“That’s not a bad idea, Rickles,” I said. “In fact, I think it might be the best idea you’ve had all week. Boy, that chimp is a really aggressive driver. Don’t drive angry, chimp! You’ll have an accid–ouch, that can’t be good.”