time-on-task effects in fMRI research: why you should care

There’s a ubiquitous problem in experimental psychology studies that use behavioral measures that require participants to make speeded responses. The problem is that, in general, the longer people take to do something, the more likely they are to do it correctly. If I have you do a visual search task and ask you to tell me whether or not a display full of letters contains a red ‘X’, I’m not going to be very impressed that you can give me the right answer if I let you stare at the screen for five minutes before responding. In most experimental situations, the only way we can learn something meaningful about people’s capacity to perform a task is by imposing some restriction on how long people can take to respond. And the problem that then presents is that any changes we observe in the resulting variable we care about (say, the proportion of times you successfully detect the red ‘X’) are going to be confounded with the time people took to respond. Raise the response deadline and performance goes up; shorten it and performance goes down.

This fundamental fact about human performance is commonly referred to as the speed-accuracy tradeoff. The speed-accuracy tradeoff isn’t a law in any sense; it allows for violations, and there certainly are situations in which responding quickly can actually promote accuracy. But as a general rule, when researchers run psychology experiments involving response deaadlines, they usually work hard to rule out the speed-accuracy tradeoff as an explanation for any observed results. For instance, if I have a group of adolescents with ADHD do a task requiring inhibitory control, and compare their performance to a group of adolescents without ADHD, I may very well find that the ADHD group performs more poorly, as reflected by lower accuracy rates. But the interpretation of that result depends heavily on whether or not there are also any differences in reaction times (RT). If the ADHD group took about as long on average to respond as the non-ADHD group, it might be reasonable to conclude that the ADHD group suffers a deficit in inhibitory control: they take as long as the control group to do the task, but they still do worse. On the other hand, if the ADHD group responded much faster than the control group on average, the interpretation would become more complicated. For instance, one possibility would be that the accuracy difference reflects differences in motivation rather than capacity per se. That is, maybe the ADHD group just doesn’t care as much about being accurate as about responding quickly. Maybe if you motivated the ADHD group appropriately (e.g., by giving them a task that was intrinsically interesting), you’d find that performance was actually equivalent across groups. Without explicitly considering the role of reaction time–and ideally, controlling for it statistically–the types of inferences you can draw about underlying cognitive processes are somewhat limited.

An important point to note about the speed-accuracy tradeoff is that it isn’t just a tradeoff between speed and accuracy; in principle, any variable that bears some systematic relation to how long people take to respond is going to be confounded with reaction time. In the world of behavioral studies, there aren’t that many other variables we need to worry about. But when we move to the realm of brain imaging, the game changes considerably. Nearly all fMRI studies measure something known as the blood-oxygen-level-dependent (BOLD) signal. I’m not going to bother explaining exactly what the BOLD signal is (there are plenty of other excellent explanations at varying levels of technical detail, e.g., here, here, or here); for present purposes, we can just pretend that the BOLD signal is basically a proxy for the amount of neural activity going on in different parts of the brain (that’s actually a pretty reasonable assumption, as emerging studies continue to demonstrate). In other words, a simplistic but not terribly inaccurate model is that when neurons in region X increase their firing rate, blood flow in region X also increases, and so in turn does the BOLD signal that fMRI scanners detect.

A critical question that naturally arises is just how strong the temporal relation is between the BOLD signal and underlying neuronal processes. From a modeling perspective, what we’d really like is a system that’s completely linear and time-invariant–meaning that if you double the duration of a stimulus presented to the brain, the BOLD response elicited by that stimulus also doubles, and it doesn’t matter when the stimulus is presented (i.e., there aren’t any funny interactions between different phases of the response, or with the responses to other stimuli). As it turns out, the BOLD response isn’t perfectly linear, but it’s pretty close. In a seminal series of studies in the mid-90s, Randy Buckner, Anders Dale and others showed that, at least for stimuli that aren’t presented extremely rapidly (i.e., a minimum of 1 – 2 seconds apart), we can reasonably pretend that the BOLD response sums linearly over time without suffering any serious ill effects. And that’s extremely fortunate, because it makes modeling brain activation with fMRI much easier to do. In fact, the vast majority of fMRI studies, which employ what are known as rapid event-related designs, implicitly assume linearity. If the hemodynamic response wasn’t approximately linear, we would have to throw out a very large chunk of the existing literature–or at least seriously question its conclusions.

Aside from the fact that it lets us model things nicely, the assumption of linearity has another critical, but underappreciated, ramification for the way we do fMRI research. Which is this: if the BOLD response sums approximately linearly over time, it follows that two neural responses that have the same amplitude but differ in duration will produce BOLD responses with different amplitudes. To characterize that visually, here’s a figure from a paper I published with Deanna Barch, Jeremy Gray, Tom Conturo, and Todd Braver last year:


Each of these panels shows you the firing rates and durations of two hypothetical populations of neurons (on the left), along with the (observable) BOLD response that would result (on the right). Focus your attention on panel C first. What this panel shows you is what, I would argue, most people intuitively think of when they come across a difference in activation between two conditions. When you see time courses that clearly differ in their amplitude, it’s very natural to attribute a similar difference to the underlying neuronal mechanisms, and suppose that there must just be more firing going on in one condition than the other–where ‘more’ is taken to mean something like “firing at a higher rate”.

The problem, though, is that this inference isn’t justified. If you look at panel B, you can see that you get exactly the same pattern of observed differences in the BOLD response even when the amplitude of neuronal activation is identical, simply because there’s a difference in duration. In other words, if someone shows you a plot of two BOLD time courses for different experimental conditions, and one has a higher amplitude than the other, you don’t know whether that’s because there’s more neuronal activation in one condition than the other, or if processing is identical in both conditions but simply lasts longer in one than in the other. (As a technical aside, this equivalence only holds for short trials, when the BOLD response doesn’t have time to saturate. If you’re using longer trials–say 4 seconds more more–then it becomes fairly easy to tell apart changes in duration from changes in amplitude. But the vast majority of fMRI studies use much shorter trials, in which case the problem I describe holds.)

Now, functionally, this has some potentially very serious implications for the inferences we can draw about psychological processes based on observed differences in the BOLD response. What we would usually like to conclude when we report “more” activation for condition X than condition Y is that there’s some fundamental difference in the nature of the processes involved in the two conditions that’s reflected at the neuronal level. If it turns out that the reason we see more activation in one condition than the other is simply that people took longer to respond in one condition than in the other, and so were sustaining attention for longer, that can potentially undermine that conclusion.

For instance, if you’re contrasting a feature search condition with a conjunction search condition, you’re quite likely to observe greater activation in regions known to support visual attention. But since a central feature of conjunction search is that it takes longer than a feature search, it could theoretically be that the same general regions support both types of search, and what we’re seeing is purely a time-on-task effect: visual attention regions are activated for longer because it takes longer to complete the conjunction search, but these regions aren’t doing anything fundamentally different in the two conditions (at least at the level we can see with fMRI). So this raises an issue similar to the speed-accuracy tradeoff we started with. Other things being equal, the longer it takes you to respond, the more activation you’ll tend to see in a given region. Unless you explicitly control for differences in reaction time, your ability to draw conclusions about underlying neuronal processes on the basis of observed BOLD differences may be severely hampered.

It turns out that very few fMRI studies actually control for differences in RT. In an elegant 2008 study discussing different ways of modeling time-varying signals, Jack Grinband and colleagues reviewed a random sample of 170 studies and found that, “Although response times were recorded in 82% of event-related studies with a decision component, only 9% actually used this information to construct a regression model for detecting brain activity”. Here’s what that looks like (Panel C), along with some other interesting information about the procedures used in fMRI studies:

So only one in ten studies made any effort to control for RT differences; and Grinband et al argue in their paper that most of those papers didn’t model RT the right way anyway (personally I’m not sure I agree; I think there are tradeoffs associated with every approach to modeling RT–but that’s a topic for another post).

The relative lack of attention to RT differences is particularly striking when you consider what cognitive neuroscientists do care a lot about: differences in response accuracy. The majority of researchers nowadays make a habit of discarding all trials on which participants made errors. The justification we give for this approach–which is an entirely reasonable one–is that if we analyzed correct and incorrect trials together, we’d be confounding the processes we care about (e.g., differences between conditions) with activation that simply reflects error-related processes. So we drop trials with errors, and that gives us cleaner results.

I suspect that the reasons for our concern with accuracy effects but not RT effects in fMRI research are largely historical. In the mid-90s, when a lot of formative cognitive neuroscience was being done, people (most of them then located in Pittsburgh, working in Jonathan Cohen‘s group) discovered that the brain doesn’t like to make errors. When people make mistakes during task performance, they tend to recognize that fact; on a neural level, frontoparietal regions implicated in goal-directed processing–and particularly the anterior cingulate cortex–ramp up activation substantially. The interpretation of this basic finding has been a source of much contention among cognitive neuroscientists for the past 15 years, and remains a hot area of investigation. For present purposes though, we don’t really care why error-related activation arises; the point is simply that it does arise, and so we do the obvious thing and try to eliminate it as a source of error from our analyses. I suspect we don’t do the same for RT not because we lack principled reasons to, but because there haven’t historically been clear-cut demonstrations of the effects of RT differences on brain activation.

The goal of the 2009 study I mentioned earlier was precisely to try to quantify those effects. The hypothesis my co-authors and I tested was straightforward: if brain activity scales approximately linearly with RT (as standard assumptions would seem to entail), we should see a strong “time-on-task” effect in brain areas that are associated with the general capacity to engage in goal-directed processing. In other words, on trials when people take longer to respond, activation in frontal and parietal regions implicated in goal-directed processing and cognitive control should increase. These regions are often collectively referred to as the “task-positive” network (Fox et al., 2005), in reference to the fact that they tend to show activation increases any time people are engaging in goal-directed processing, irrespective of the precise demands of the task. We figured that identifying a time-on-task effect in the task-positive network would provide a nice demonstration of the relation between RT differences and the BOLD response, since it would underscore the generality of the problem.

Concretely, what we did was take five datasets that were lying around from previous studies, and do a multi-study analysis focusing specifically on RT-related activation. We deliberately selected studies that employed very different tasks, designs, and even scanners, with the aim of ensuring the generalizability of the results. Then, we identified regions in each study in which activation covaried with RT on a trial-by-trial basis. When we put all of the resulting maps together and picked out only those regions that showed an association with RT in all five studies, here’s the map we got:


There’s a lot of stuff going on here, but in the interest of keeping this post short slightly less excruciatingly long, I’ll stick to the frontal areas. What we found, when we looked at the timecourse of activation in those regions, was the predicted time-on-task effect. Here’s a plot of the timecourses from all five studies for selected regions:


If you focus on the left time course plot for the medial frontal cortex (labeled R1, in row B), you can see that increases in RT are associated with increased activation in medial frontal cortex in all five studies (the way RT effects are plotted here is not completely intuitive, so you may want to read the paper for a clearer explanation). It’s worth pointing out that while these regions were all defined based on the presence of an RT effect in all five studies, the precise shape of that RT effect wasn’t constrained; in principle, RT could have exerted very different effects across the five studies (e.g., positive in some, negative in others; early in some, later in others; etc.). So the fact that the timecourses look very similar in all five studies isn’t entailed by the analysis, and it’s an independent indicator that there’s something important going on here.

The clear-cut implication of these findings is that a good deal of BOLD activation in most studies can be explained simply as a time-on-task effect. The longer you spend sustaining goal-directed attention to an on-screen stimulus, the more activation you’ll show in frontal regions. It doesn’t much matter what it is that you’re doing; these are ubiquitous effects (since this study, I’ve analyzed many other datasets in the same way, and never fail to find the same basic relationship). And it’s worth keeping in mind that these are just the regions that show common RT-related activation across multiple studies; what you’re not seeing are regions that covary with RT only within one (or for that matter, four) studies. I’d argue that most regions that show involvement in a task are probably going to show variations with RT. After all, that’s just what falls out of the assumption of linearity–an assumption we all depend on in order to do our analyses in the first place.

Exactly what proportion of results can be explained away as time-on-task effects? That’s impossible to determine, unfortunately. I suspect that if you could go back through the entire fMRI literature and magically control for trial-by-trial RT differences in every study, a very large number of published differences between experimental conditions would disappear. That doesn’t mean those findings were wrong or unimportant, I hasten to note; there are many cases in which it’s perfectly appropriate to argue that differences between conditions should reflect a difference in quantity rather than quality. Still, it’s clear that in many cases that isn’t the preferred interpretation, and controlling for RT differences probably would have changed the conclusions. As just one example, much of what we think of as a “conflict” effect in the medial frontal cortex/anterior cingulate could simply reflect prolonged attention on high-conflict trials. When you’re experiencing cognitive difficulty or conflict, you tend to slow down and take longer to respond, which is naturally going to produce BOLD increases that scale with reaction time. The question as to what remains of the putative conflict signal after you control for RT differences is one that hasn’t really been adequately addressed yet.

The practical question, of course, is what we should do about this. How can we minimize the impact of the time-on-task effect on our results, and, in turn, on the conclusions we draw? I think the most general suggestion is to always control for reaction time differences. That’s really the only way to rule out the possibility that any observed differences between conditions simply reflect differences in how long it took people to respond. This leaves aside the question of exactly how one should model out the effect of RT, which is a topic for another time (though I discuss it at length in the paper, and the Grinband paper goes into even more detail). Unfortunately, there isn’t any perfect solution; as with most things, there are tradeoffs inherent in pretty much any choice you make. But my personal feeling is that almost any approach one could take to modeling RT explicitly is a big step in the right direction.

A second, and nearly as important, suggestion is to not only control for RT differences, but to do it both ways. Meaning, you should run your model both with and without an RT covariate, and carefully inspect both sets of results. Comparing the results across the two models is what really lets you draw the strongest conclusions about whether activation differences between two conditions reflect a difference of quality or quantity. This point applies regardless of which hypothesis you favor: if you think two conditions draw on very similar neural processes that differ only in degree, your prediction is that controlling for RT should make effects disappear. Conversely, if you think that a difference in activation reflects the recruitment of qualitatively different processes, you’re making the prediction that the difference will remain largely unchanged after controlling for RT. Either way, you gain important information by comparing the two models.

The last suggestion I have to offer is probably obvious, and not very helpful, but for what it’s worth: be cautious about how you interpret differences in activation any time there are sizable differences in task difficulty and/or mean response time. It’s tempting to think that if you always analyze only trials with correct responses and follow the suggestions above to explicitly model RT, you’ve done all you need in order to perfectly control for the various tradeoffs and relationships between speed, accuracy, and cognitive effort. It really would be nice if we could all sleep well knowing that our data have unambiguous interpretations. But the truth is that all of these techniques for “controlling” for confounds like difficulty and reaction time are imperfect, and in some cases have known deficiencies (for instance, it’s not really true that throwing out error trials eliminates all error-related activation from analysis–sometimes when people don’t know the answer, they guess right!). That’s not to say we should stop using the tools we have–which offer an incredibly powerful way to peer inside our gourds–just that we should use them carefully.


Yarkoni T, Barch DM, Gray JR, Conturo TE, & Braver TS (2009). BOLD correlates of trial-by-trial reaction time variability in gray and white matter: a multi-study fMRI analysis. PloS one, 4 (1) PMID: 19165335

Grinband J, Wager TD, Lindquist M, Ferrera VP, & Hirsch J (2008). Detection of time-varying signals in event-related fMRI designs. NeuroImage, 43 (3), 509-20 PMID: 18775784

12 thoughts on “time-on-task effects in fMRI research: why you should care”

  1. Thanks very much for this enlightening post. As a layman interested in neuroscience, I’m always interested in learning about possible pitfalls in the research, and this is a very clear discussion of one of them. I’ll be keeping it in mind when reading about further papers.

  2. This is a wonderfully written post about a complex topic. I think your third suggestion is the most important one. I’ve reviewed a number of papers in which the authors think that because they have, for example, included RT as a covariate, they don’t need to consider issues like task difficulty any more. The bottom line is that we don’t know what “task difficulty” means, or what can be interpreted from longer or shorter RTs. There are many different possible explanations. Thus, exercising caution in the inferences that one tries to draw from between-condition differences is very valuable advice.

  3. Thanks for the comments; glad you enjoyed the post! Coronal, I agree that, in a sense, a general call for caution is always the most important recommendation. The trouble though is that it’s kind of a vague directive that can be difficult to unpack in many situations, so I think it’s nice to be able to make specific recommendations whenever possible.

  4. Your blog is incredible – and this post obviously strikes close to home. I hope that you just wanted to write it, and that you don’t feel like I’ve brushed aside, misunderstood, or not adequately addressed your perspective in the work I’ve done that you’re familiar with. Of course, I don’t even remember how much I told you about what I did (the shitstorm you started continued with Greg for a while…) and I’m probably just being narcissistic here.

    In case you care, we modeled RT along with our best estimates of the covert process occurring even when there is no response; this RT/covert-process estimate was modeled as a regressor with variable duration, and the contrasts of interest were solid both with and without this regressor.

    Anyway, cheers to you and your truly amazing blog.

  5. Hi Chris, thanks for the kind words! No no, it certainly wasn’t directed at you (or anyone else). It’s a post I’ve meant to write for a long time (shameless self-promotion and all that) but just never got around to. Hope you keep up your renewed blogging burst!

  6. Very nice post, and a nice paper. I have a plug and one or two comments.

    We recently published a paper on a dot-motion perceptual decision-making task that examines “difficulty” and RT in tandem. see Kayser et. al, Journal of Neurophysiology, 2010. We examined RT effects that occur within a given level of (perceptual decision-making) difficulty.

    A couple comments.

    There is a kind of “cart before the horse” issue when we use the phrase “time on task” or “regress out RT”. Yes, harder tasks take longer to complete than easier ones, but “regressing out” RT doesn’t always strike me as a good solution. (I wonder what Saul Sternberg would think of the analysis practice of “controlling for RT as a nuisance variable”?)

    If one adopts the attitude that time on task effects are trivial and simply require statistical correction, then what do we make of behavioral experiments that use RT as their primary dependent measure? Suppose as a reviewer of such an article I wrote: “this is all very nice. The linear relationship between RT and set size is compelling. But I fear this is a time on task effect. Please rerun the analysis with RT as a nuisance covariate”.

    For neuroimaging studies, I think RT should be modeled — just not as a nuisance variable — i.e. something unimportant to be swept away, controlled for, etc. Rather, I see it as an important source of information that can be used gain further insight into the neural processes and their temporal dynamics. See our JNP paper for an attempt at this.

    In short, I would agree we need to pay more attention to RT but my emphasis would be more on using it to our advantage, making the most of it, asking what it can tell us, rather than seeing how best to control for it or regress it out.

  7. Hi Brad,

    Thanks for the comment! Your paper’s now on my reading list; looking forward to reading it.

    I completely agree the solution isn’t to just model out RT, and one of the points I tried to emphasize in the paper as well as this post is that ideally we should really do it both ways and directly compare models with and without RT regressors.

    The critical point, I think, is to figure out what prediction one’s hypothesis makes with respect to RT, and then test that. If you think that it’s just fine for the effects you’re interested in to scale with RT (i.e., you think it’s a difference in quantity and not quality), then your prediction is that controlling for RT will eliminate the effect, which is easily tested. Conversely, if you think that the differences you care about should be qualitative and not just quantitative, then you’re predicting they’ll stick around after regressing out RT. I think either prediction is fine as long as you’re explicit about it and do the work to test it.

    That said, it does seem to me that the (vast?) majority of fMRI studies implicitly favor a qualitative-difference hypothesis, and wouldn’t maintain the same interpretation of the results if it turned out that the only difference between two conditions was how long it took participants to respond. So in that sense, I think it’s legitimate for a reviewer to ask if a particular difference in activation might simply reflect a time-on-task effect. And the appropriate response then could be either “yes, absolutely, and that’s still fine, we’re just using RT as another window into the processes we care about”, or “well, you’re right, so we included an RT regressor, and here are the results…”.

    Beyond that, I completely agree–RT effects are of interest unto themselves, and can be used in all sorts of interesting ways to help gain traction on what the brain is doing.

  8. Hi Tal – I totally agree with you and Brad that RT should be modeled as a covariate of interest, not as a throwaway. I have a big problem with the Grinband approach, though: it seems to me that you really want to have a constant-duration regressor and a parametric regressor that models RT. since the parametric regressor will be orthogonal to the constant duration regressor, adding it to the model should not change the parameter estimates for the constant duration regressor. The Grinband approach (i.e. only including an RT-modulated regressor) will improve fit for things that scale with RT, but will decrease fit for regions that do not scale with RT, and it makes it impossible to tell which is which.

  9. Hi Russ,

    Yeah, I agree that you don’t want to lose the ability to decouple the constant-duration regressor from the RT regressor. I think what Jack would argue though (I had a long email exchange with him about this; perhaps he’ll chime in here) is that it generally makes more sense to model RT differences using a variable-duration model than a variable-amplitude model, because you generally assume that a trial with longer RT reflects a longer duration of neuronal firing, and not necessarily a higher firing rate over the same duration. At short intervals (e.g., < 3 seconds), the two models are virtually indistinguishable (he has a nice figure in his paper showing this), but once you get over that, modeling RT differences by varying the amplitude of the regressor rather than its duration can cause serious misestimation (I think this cuts the other way too, of course; if there really are differences in firing rate, it's the duration-based regressor that'll produce poor estimates). Personally, I think it's really difficult to make a principled determination as to whether a duration-based or amplitude-based model is more appropriate (I suspect RT differences reflect both firing rates and durations most of the time), which is why I usually opt for an FIR approach. I think an FIR approach also takes care of another issue I discuss in the paper, which is that not all RT differences reflect time-on-task effects; many probably reflect attentional lapses--cases where processing initiates later in the trial. Lapses have the effect of pushing activation back in time but generally don't alter the shape of the response (at least in my view--Dan Weissman might disagree), so a parametric regressor that scales with amplitude isn't going to fit them properly. You could account for lapses by including temporal derivatives, but since you generally don't look at those independently either, that just brings up the same point you raise. So ultimately I like plotting the empirically-estimated RT timecourse (modeled parametrically at each time point) and interpreting that--though of course all of the standard FIR caveats apply when it comes to statistical analysis.

  10. Right, I agree that if you are talking about big differences in RT then it makes sense to use duration rather than amplitude modulation – but again you can do that using a parametric modulator (a variable duration regressor that is orthogonalized to the constant-height regressor). The FIR approach sounds reasonable too, though I generally shy away from FIR due to the problems with overfitting.

  11. From a mathematical perspective, this is a little surprising: convolution is an invertible operation, so there should be no ambiguity, in theory. With sufficient precision in the scanner, a sufficiently linear response, and a sufficiently precise HRF, deconvolution should be unambiguous. Which of these shortcomings contributes most to the problem? Is it likely to be overcome in the near future?

    Great post!

  12. Hi A P,

    Yes, in principle there shouldn’t be any ambiguity, but as you suggest, there are practical limitations that make it virtually impossible to dissociate duration from amplitude (at short durations) right now. The major limitation is poor estimation; the acquisition sequences in widespread use typically have sampling rates of ~ 0.5 Hz, and even so, activation estimates tend to be very noisy (at least in comparison to the level of precision you’d need to pull apart the minute differences predicted by duration vs. amplitude effects). Beyond that, non-linearities in the BOLD response also become an issue. For most practical purposes, we can get away with pretending the BOLD response is linear, but that’s only because we’re usually satisfied with drawing crude inferences like “there’s more activation in condition X than Y”. Once you start trying to make very fine-grained discriminations in predicted responses (e.g., see Panel D in the first figure above), I think all bets are off, and frankly, I don’t know how you would ever feel comfortable concluding you know exactly what the shape of the underlying impulse is. So I doubt this problem is going to be solved any time soon–but I hope I’m wrong!

Leave a Reply