Several people left enlightening comments on my last post about the ADHD-200 Global Competition results, so I thought I’d bump some of them up and save you the trip back there (though the others are worth reading too!), since they’re salient to some of the issues raised in the last post.
Matthew Brown, the project manager on the Alberta team that was disqualified on a minor technicality (they cough didn’t use any of the imaging data), pointed out that they actually did initially use the imaging data (answering Sanjay’s question in another comment)–it just didn’t work very well:
For the record, we tried a pile of imaging-based approaches. As a control, we also did classification with age, gender, etc. but no imaging data. It was actually very frustrating for us that none of our imaging-based methods did better than the no imaging results. It does raise some very interesting issues.
He also pointed out that the (relatively) poor performance of the imaging-based classifiers isn’t cause for alarm:
I second your statement that we’ve only scratched the surface with fMRI-based diagnosis (and prognosis!). There’s a lot of unexplored potential here. For example, the ADHD-200 fMRI scans are resting state scans. I suspect that fMRI using an attention task could work better for diagnosing ADHD. Resting state fMRI has also shown promise for diagnosing other conditions (eg: schizophrenia – see Shen et al.).
I think big, multi-centre datasets like the ADHD-200 are the future though the organizational and political issues with this approach are non-trivial. I’m extremely impressed with the ADHD-200 organizers and collaborators for having the guts and perseverance to put this data out there. I intend to follow their example with the MR data that we collect in future.
I couldn’t agree more!
Jesse Brown explains why his team didn’t use the demographics alone:
I was on the UCLA/Yale team, we came in 4th place with 54.87% accuracy. We did include all the demographic measures in our classifier along with a boatload of neuroimaging measures. The breakdown of demographics by site in the training data did show some pretty strong differences in ADHD vs. TD. These differences were somewhat site dependent, eg girls from OHSU with high IQ are very likely TD. We even considered using only demographics at some point (or at least one of our wise team members did) but I thought that was preposterous. I think we ultimately figured that the imaging data may generalize better to unseen examples, particularly for sites that only had data in the test dataset (Brown). I guess one lesson is to listen to the data and not to fall in love with your method. Not yet anyway.
I imagine some of the other groups probably had a similar experience of trying to use the demographic measures alone and realizing they did better than the imaging data, but sticking with the latter anyway. Seems like a reasonable decision, though ultimately, I still think it’s a good thing the Alberta team used only the demographic variables, since their results provided an excellent benchmark against which to compare the performance of the imaging-based models. Sanjay Srivastava captured this sentiment nicely:
Two words: incremental validity. This kind of contest is valuable, but I’d like to see imaging pitted against behavioral data routinely. The fact that they couldn’t beat a prediction model built on such basic information should be humbling to advocates of neurodiagnosis (and shows what a low bar “better than chance” is). The real question is how an imaging-based diagnosis compares to a clinician using standard diagnostic procedures. Both “which is better” (which accounts for more variance as a standalone prediction) and “do they contain non-overlapping information” (if you put both predictions into a regression, does one or both contribute unique variance).
And Russ Poldrack raised a related question about what it is that the imaging-based models are actually doing:
What is amazing is that the Alberta team only used age, sex, handedness, and IQ. That suggests to me that any successful imaging-based decoding could have been relying upon correlates of those variables rather than truly decoding a correlate of the disease.
This seems quite plausible inasmuch as age, sex, and IQ are pretty powerful variables, and there are enormous literatures on their structural and functional correlates. While there probably is at least some incremental information added by the imaging data (and very possibly a lot of it), it’s currently unclear just how much–and whether that incremental variance might also be picked up by (different) behavioral variables. Ultimately, time (and more competitions like this one!) will tell.
Although it didn’t happen for this ADHD data set, neural measures can indeed produce better predictions than behavioural ones. This study of dyslexia by Hoeft et al. is a nice example: http://www.pnas.org/content/108/1/361.short
As for what the problem might have been with the ADHD data, I think that Matthew Brown’s comment nails it: “There’s a lot of unexplored potential here. For example, the ADHD-200 fMRI scans are resting state scans. I suspect that fMRI using an attention task could work better for diagnosing ADHD.”
Hey Tal, I worked with Jesse and he pointed me to your nice blog. Here’s an interesting point that I haven’t seen brought up yet:
The ADHD prevalence in the test set was 55%, which means that you could have received 195*0.55=107 or 108 pts (and beaten half the teams!) just by calling *everybody* TD. Going with the highest prevalence group in the training set like this is usually called the “no-information rate” in machine learning, and in my opinion can be a more useful benchmark of “chance” (in terms of the actual positive predictive value of your classifier in a real population) than assuming prior probabilities of ~1/3 1/3 1/3.
I was the organizer of the Hopkins team. Here are some of our thoughts on the competition
https://docs.google.com/document/d/1dNV__iwXLLQiM7B7XUgj19tFEzrYcbGjdJYEHG_t-nY/edit
Hello all,
Just wanted to thank all of you for your contributions to this excellent discussion. It has been quite enlightening for an up-and-coming student of neuroscience and machine learning. I’m really impressed with all of your hard work and thought.
I found one comment interesting which wasn’t addressed here: Gael Varoquaux noted that there was a high likelihood of overfitting classification models. This was my intuition as well: given the potentially massive amount of variables in a data set such as this, the risk of overfitting to the training could be very high. Could anyone who participated in the competition comment (just generally) on any techniques they may have used to either remove or attenuate the influence of possibly non-predictive parameters?
Hi John,
I lead the Hopkins team. We were very concerned about over-fitting. Our approach used data splitting of the training data. We then only used our internal test data to compare between algorithms.