UPDATE 10/13: a number of commenters left interesting comments below addressing some of the issues raised in this post. I expand on some of them here.
The ADHD-200 Global Competition, announced earlier this year, was designed to encourage researchers to develop better tools for diagnosing mental health disorders on the basis of neuroimaging data:
The competition invited participants to develop diagnostic classification tools for ADHD diagnosis based on functional and structural magnetic resonance imaging (MRI) of the brain. Applying their tools, participants provided diagnostic labels for previously unlabeled datasets. The competition assessed diagnostic accuracy of each submission and invited research papers describing novel, neuroscientific ideas related to ADHD diagnosis. Twenty-one international teams, from a mix of disciplines, including statistics, mathematics, and computer science, submitted diagnostic labels, with some trying their hand at imaging analysis and psychiatric diagnosis for the first time.
Data for the contest came from several research labs around the world, who donated brain scans from participants with ADHD (both inattentive and hyperactive subtypes) as well as healthy controls. The data were made openly available through the International Neuroimaging Data-sharing Initiative, and nicely illustrate the growing movement towards openly sharing large neuroimaging datasets and promoting their use in applied settings. It is, in virtually every respect, a commendable project.
Well, the results of the contest are now in–and they’re quite interesting. The winning team, from Johns Hopkins, came up with a method that performed substantially above chance and showed particularly high specificity (i.e., it made few false diagnoses, though it missed a lot of true ADHD cases). And all but one team performed above chance, demonstrating that the imaging data has at least some (though currently not a huge amount) of utility in diagnosing ADHD and ADHD subtype. There are some other interesting results on the page worth checking out.
But here’s hands-down the most entertaining part of the results, culled from the “Interesting Observations” section:
The team from the University of Alberta did not use imaging data for their prediction model. This was not consistent with intent of the competition. Instead they used only age, sex, handedness, and IQ. However, in doing so they obtained the most points, outscoring the team from Johns Hopkins University by 5 points, as well as obtaining the highest prediction accuracy (62.52%).
…or to put it differently, if you want to predict ADHD status using the ADHD-200 data, your best bet is to not really use the ADHD-200 data! At least, not the brain part of it.
I say this with tongue embedded firmly in cheek, of course; the fact that the Alberta team didn’t use the imaging data doesn’t mean imaging data won’t ultimately be useful for diagnosing mental health disorders. It remains quite plausible that ten or twenty years from now, structural or functional MRI scans (or some successor technology) will be the primary modality used to make such diagnoses. And the way we get from here to there is precisely by releasing these kinds of datasets and promoting this type of competition. So on the whole, I think this should actually be seen as a success story for the field of human neuroimaging–especially since virtually all of the teams performed above chance using the imaging data.
That said, there’s no question this result also serves as an important and timely reminder that we’re still in the very early days of brain-based prediction. Right now anyone who claims they can predict complex real-world behaviors better using brain imaging data than using (much cheaper) behavioral data has a lot of ‘splainin to do. And there’s a good chance that they’re trying to sell you something (like, cough, neuromarketing ‘technology’).
Two words: incremental validity. This kind of contest is valuable, but I’d like to see imaging pitted against behavioral data routinely. The fact that they couldn’t beat a prediction model built on such basic information should be humbling to advocates of neurodiagnosis (and shows what a low bar “better than chance” is). The real question is how an imaging-based diagnosis compares to a clinician using standard diagnostic procedures. Both “which is better” (which accounts for more variance as a standalone prediction) and “do they contain non-overlapping information” (if you put both predictions into a regression, does one or both contribute unique variance).
Sanjay, yeah, I agree. Although most of these classification/prediction applications now use sophisticated machine learning techniques rather than OLS regression (which usually performs much more poorly). So in theory, as long as the behavioral indicators are included as features alongside the imaging data, the model should use the behavioral and neuroimaging information in optimal (and often non-linear) ways. Unfortunately, the models sometimes have a black box flavor, and figuring out exactly what they’re doing after the fact is not always so easy.
Yeah, I probably should have said prediction errors instead of variance explained. (I assume they didn’t do just an OLS regression – actually you couldn’t with a 3-category classification problem). But terminology aside, with only those 4 predictors, how complicated could the prediction model be? (at least in comparison to what you might do with imaging data)
Also, if the 4 variables were available to all of the researchers, it must have not even occurred to the other teams to include them alongside the neuro data in their models. The Alberta team didn’t include the neuro data in their models, which suggests to me either that the imaging data did not make the model better, or they submitted their 4-predictor behavioral model to make a point. I’d like to know which one it is.
Brilliant! But I bet the Alberta team doesn’t get their paper published in Journal of Neuroscience (or indeed anywhere).
I recently had an email conversation with an a researcher doing similar work in autism. What I took from it was that basically “MRI-diagnosis” wasn’t what they were interested in but it was a useful meme to get their paper in a high impact journal and themselves lots of press coverage.
My view is that it might ultimately be useful for identifying subtypes of ADHD or autism, but for (extremely broadly defined) behaviourally defined disorders, the chances of there being a “brain signature” that maps onto DSM IV (or DSM any number) criteria is remote to say the least.
Incidentally, were the diagnostic algorithms evaluated based on replication samples or leave-one-out? I’m interested to know whether leave-one-out really is a good indicator of generalisability?
What is amazing is that the Alberta team only used age, sex, handedness, and IQ. That suggests to me that any successful imaging-based decoding could have been relying upon correlates of those variables rather than truly decoding a correlate of the disease.
I’m the project manager on the University of Alberta team.
Tal, we met briefly at Human Brain Mapping in Montreal, incidentally.
For the record, we tried a pile of imaging-based approaches. As a control, we also did classification with age, gender, etc. but no imaging data. It was actually very frustrating for us that none of our imaging-based methods did better than the no imaging results. It does raise some very interesting issues.
I second your statement that we’ve only scratched the surface with fMRI-based diagnosis (and prognosis!). There’s a lot of unexplored potential here. For example, the ADHD-200 fMRI scans are resting state scans. I suspect that fMRI using an attention task could work better for diagnosing ADHD. Resting state fMRI has also shown promise for diagnosing other conditions (eg: schizophrenia – see Shen et al.).
I think big, multi-centre datasets like the ADHD-200 are the future though the organizational and political issues with this approach are non-trivial. I’m extremely impressed with the ADHD-200 organizers and collaborators for having the guts and perseverance to put this data out there. I intend to follow their example with the MR data that we collect in future.
I was on the UCLA/Yale team, we came in 4th place with 54.87% accuracy. We did include all the demographic measures in our classifier along with a boatload of neuroimaging measures. The breakdown of demographics by site in the training data did show some pretty strong differences in ADHD vs. TD. These differences were somewhat site dependent, eg girls from OHSU with high IQ are very likely TD. We even considered using only demographics at some point (or at least one of our wise team members did) but I thought that was preposterous. I think we ultimately figured that the imaging data may generalize better to unseen examples, particularly for sites that only had data in the test dataset (Brown). I guess one lesson is to listen to the data and not to fall in love with your method. Not yet anyway.
You could argue that the fact that demographics was able to predict diagnosis just means the dataset is poorly matched, and it’s therefore cheating to use it.
On the other hand, in the real world, datasets are not matched e.g. we know that ADHD is more common in males in the general population.
So it’s not clear that such data should be matched, but as Russ points out, if they’re not, then any neuroimaging based diagnosis could just be picking up on demographics.
So I would say they should be.
Congrats to the Alberta group! What was the formula (some function of age, sex, handedness, and IQ) that they used for their predictive diagnoses, and by which methods did they arrive at it? I can’t seem to find this info on the ADHD-200 website, or anyone else. It would be interesting to know! I hope that they publish a paper about this. – Raj
I agree that a functional MRI experiment should be more useful.
But there is one thing i have difficulties to understand: Why would one use MRI to make a diagnosis one can make much cheaper without MRI? I don’t really see the point, except maybe as a weak proof of concept, or am i missing something here?
Predicting response to (drug) treatment, for instance, seems much more worthwhile.
– guido
The Alberta team’s results draw attention to a major challenge faced by the field in our efforts to generate predictive tools using existent data: group differences in base phenotypic variables that reflect population characteristics. ADHD in clinically referred samples is much more frequently recognized in boys than girls. As such, many studies have different M:F ratios for ADHD and typically developing control (TDC) groups. In the ADHD-200 sample, this was the case (% male in training set: TDC – 53%; ADHD- 79%; in test set: TDC – 48%; ADHD – 71%). Similarly, estimates of performance IQ in ADHD are lower than that of various types of comparisons, reflecting the inefficiency and inconsistency of responding on timed tasks. This difference is typically about 7-10 points. In the ADHD-200 sample, IQ estimates differed between the TDC and ADHD groups (Training Set: 114 for TDC vs. 106 for ADHD, p < 0.001; Test Set: 113 for TDC vs. 103 for ADHD, p < 0.001). These and other subtler baseline differences clearly provided sufficient statistical power in the naturalistic/artificial context of a contest for predictive power.
In the real world, it would be less likely that factors such as sex and IQ would be sufficient to predict diagnosis from among the myriad possibilities. In this case, with only two options, such clues can be usefully exploited– as the Alberta group beautifully demonstrated. Russ Poldrack has highlighted an even more significant concern – what if the imaging-based approaches are detecting the neural correlates of these phenotypic variables rather than neural correlates of the disorder itself? This was one of my first thoughts too when I saw the Alberta results. When we characterize differences in these populations, factors such as age, sex and IQ can be treated as nuisance covariates to improve the likelihood that we will focus on the underlying brain correlates. For prediction efforts we will need to do the same – take these variables into account as confounding variables, rather than use them as features for prediction. When a lab test is used to look at blood cell counts, there are different norms for men and women, as well as for collection sites. These are treated as confounds, not as predictive factors when engaged in a clinical process. Perhaps more importantly, these results point out the need to create large-scale samples that include multiple clinical populations; these base variables will be less prominent if we have multiple disorders present in a sample, thereby increasing the focus on the pathophysiology.
With respect to why attempt to diagnose a psychiatric illness such as ADHD with an MRI – there are a few considerations. First, we envision imaging having a role in diagnostic clarification, not as a first-line tool. In the scenario of a common viral infection, a doctor takes the history, makes the initial diagnosis and proceeds with a treatment plan. If the symptoms do not resolve, or the presentation is too complex to be confident of what is going on, lab tests are ordered for diagnostic clarification. There are also times when one might order a diagnostic test to guide treatment selection, or obtain objective measures of treatment response. In child psychiatry, information comes from many sources regarding treatment response, and they are often in conflict. So, even with its substantial costs, MRI could one day attain a reasonable level of utility for specific complex cases. Still, our hope is that what is learned from MRI will be translated into more readily available tools, such as EEG or NIRS.
Finally, the primary goal of the competition was not to develop a diagnostic test for ADHD, as the necessary data samples and approaches do not yet exist. Rather, it was to promote an open science model to foster competitive collaboration among members of the imaging community and invite the broader scientific community to join us in confronting the challenges of psychiatric imaging research. In this regard, even this initial effort can already be characterized as a success.
Curse of dimensionality and overfiting: the underpinnings of high-dimensional statistics. In other words, adding variables does not make a prediction task easier, even if they contain information. On the contrary, it makes it harder: finding the relevant information in massive datasets in a “needle in a hay stack” problem. As a result, given a dataset where the number of subjects is small compared to the number of variables, you are bound to pick up the wrong ones. This is called overfiting.
Concerning the question “Why would one use MRI to make a diagnosis one can make much cheaper without MRI?”. The aim of MRI diagnosis in my opinion is not to replace behavioural testing, but to get us closer to the underlying mechanisms involved in the disorder. That’s the interest of endophenotypes: to be closer to the biological etiology than the behavioural test…
Russ Poldracks’ comment is quite interesting, and still possible to evaluate as far as signal detection goes. Did the teams that didn’t control for, say, sex see disproportionately more false negatives among females, and more false positives among males?
Beyond it not being the real point of this competition, I would be surprised if beating behavioural diagnosis is even technically possible under the current conventions, behavioral questionnaires are pretty much the general definition of the disorder stated in the form of a question.
Bravo. Totally agree.