the neuroinformatics of Neopets

In the process of writing a short piece for the APS Observer, I was fiddling around with Google Correlate earlier this evening. It’s a very neat toy, but if you think neuroimaging or genetics have a big multiple comparisons problem, playing with Google Correlate for a few minutes will put things in perspective. Here’s a line graph displaying the search term most strongly correlated (over time) with searches for “neuroinformatics”:

That’s right, the search term that covaries most strongly with “neuroinformatics” is none other than “Illinois film office” (which, to be fair, has a pretty appealing website). Other top matches include “wma support”, “sim codes”, “bed-in-a-bag”, “neopets secret”, “neopets guild”, and “neopets secret avatars”.

I may not have learned much about neuroinformatics from this exercise, but I did get a pretty good sense of how neuroinformaticians like to spend their free time…

 

p.s. I was pretty surprised to find that normalized search volume for just about every informatics-related term has fallen sharply in the last 10 years. I went in expecting the opposite! Maybe all the informaticians were early search adopters, and the rest of the world caught up? No, probably not. Anyway, enough of this; Neopia is calling me!

p.p.s. Seriously though, this is why data fishing expeditions are dangerous. Any one of these correlations is significant at p-less-than-point-whatever-you-like. And if your publication record depended on it, you could probably tell yourself a convincing story about why neuroinformaticians need to look up Garmin eMaps…

Attention publishers: the data in your tables want to be free! Free!

The Neurosynth database is getting an upgrade over the next couple of weeks; it’s going to go from 4,393 neuroimaging studies to around 5,800. Unfortunately, updating the database is kind of a pain, because academic publishers like to change the format of their full-text HTML articles, which has a nasty habit of breaking the publisher-specific HTML parsers I’ve written. When you expect ScienceDirect to give you <table cellspacing=10>, but you get <table> with no cellspacing attribute (the horror!), bad things happen in XPath land. And then those bad things need to be repaired. And I hate repairing stuff! So I don’t do it very often. Like, once every 6 to 9 months.

In an ideal world, there would be no need to write (and fix) custom filters for different publishers, because the publishers would all simultaneously make XML representations of their articles available (in addition to HTML, PDF, etc.), and then people who have legitimate data mining reasons for regularly downloading hundreds of articles at a time wouldn’t have to cry themselves to sleep every night. But as it stands, only one major publisher of neuroimaging articles (PLoS) provides XML versions of all articles. A minority of articles from other publishers are available in XML from BioMed Central, but that’s still just a fraction of the existing literature.

Anyway, the HTML thing is annoying, but it’s possible to work around it. What’s much more problematic is that some publishers lock up the data in the tables of their articles. To make Neurosynth work, I have to be able to identify rows in tables that look like brain activations. That is, things that look roughly like this:

Most publishers are nice enough to format article tables as HTML tables; which is to say, I can look for tags like <table> and then work down the XPath tree to identify all the the rows, and then scan each rows for values that look activation-like. Then those values go into the database, and poof, next thing you know, you have meta-analytic brain activation maps from hundreds of studies. But some publishers–most notably, Frontiers–throw a wrench in the works by failing to format tables in HTML; instead, they present the tables as images (see for instance this JPEG table, pulled from this article). Which means I can’t really extract any data from them, and as a result, you’re not going to see activations from articles published in Frontiers journals in Neurosynth any time soon. So if you publish fMRI articles in Frontiers in Human Neuroscience regularly, and are wondering why I’ve been ignoring you (I like you! I promise!), now you know.

Anyway, on the remote chance that anyone reading this has any sway with people high up at Frontiers, could you please ask them to release their data? Pretty please? Lack of access to data in tables seems to be a pretty common complaint in the data mining community; I’ve talked to other people in the neuroinformatics world who’ve also expressed frustration about it, and I imagine the same is true of people in other disciplines. It’s particularly surprising given that Frontiers is, in theory, an open access publisher. I can see the data in your tables, Frontiers; why won’t you also let me read it?

Okay, I know this kind of stuff doesn’t really interest anyone; I’m just venting. The main point is, Neurosynth is going to be bigger and (very slightly) better in the near future.

in which Discover Card decides that my wife is also my daughter

Ever since I opted out of receiving preapproved credit card offers, I’ve stopped getting credit card spam in the mail (yay!). But companies I have an existing relationship with still have the right to send me various offers and updates, and there’s nothing I can do about that (except throw said offers in the trash after inspecting them and deciding that, no, I do not want to purchase the premium yacht travel insurance policy that comes with a bonus free set of matching lawn gnomes and a voucher for a buy-one-get-one-free meal at the Olive Garden). Discover Card is one of these companies, and the clever devils regularly take advantage of my amicable nature by sending me all kinds of wonderful offers. Take for instance the one I received yesterday, which starts like this:

Dear Tal,

You’ve worked for years to provide a better life for your children and prepare them for a successful future. Now that they’re in college, the overwhelming cost of higher education shouldn’t stand in the way of their success. We’re ready to help.

This is undoubtedly a very generous offer, but it comes at an inconvenient time for me, because, as it so happens, I don’t have any children right now–let alone college-aged children who need their father to front them some money. Somewhere, somehow, it seems Discover Card took a left turn at Albuquerque, when all along they were trying to get to Pismo Beach:

http://www.youtube.com/watch?v=v-s-_ME8Qns#t=1m24s

Of course, this isn’t a case of human error; I very much doubt that an overworked analyst is putting in long nights at Discover combing through random customers’ accounts looking for purchases diagnostic of college attendance (you know, like Ritalin receipts). The blame almost certainly rests with an over-inclusive algorithm that combed through my purchase history and automagically decided that I fit the profile of a middle-aged man who’s worked hard for years to provide a better life for his children. (I suppose I can take solace in the fact that while Discover probably knows what brand of toothpaste I like, it must not know my age, given that there aren’t many 31-year-old men with college-aged children.)

Anyway, I spent some time pondering what purchases I’ve made that could have tripped up Discover’s parental alarm system. And after scanning several months of statements, I’m proud to report it almost certainly has something to do with the giant monthly rent charge from “CU Residence Halls” (my wife and I live in on-campus housing). Either that or the many book-and-coffee-related charges from places with names like “University of Colorado Bookstore” and “Pretentious Coffeehouse on CU Campus”.

So that’s easy enough, right? It’s the on-campus purchases, stupid! Ah, but wait! That’s only one part of the mystery! The other, perhaps more interesting, part is this: who exactly does Discover think my college-aged child is, seeing as they clearly think I’m not the one caffeinating myself at the altar of higher education? Well, after thinking about that for a while, another clear answer emerges: it’s my wife! Discover thinks I have a college-aged daughter who also happens to be my wife! There’s no other explanation; to my knowledge, I don’t live with anyone else besides my wife (though, admittedly, I don’t check the storage closet very often).

Now, setting aside the fact that such a thing would be illegal in all fifty states, my wife and I are not very amused by this. We’re mildly amused, but we’re not very amused. But we’re refraining from making too big a fuss about it, because we’re still hoping we can get our hands on some of those sweet, sweet college loans.

In the interim, here are some questions I find myself pondering:

  • Who writes the logic that does this kind of thing? I’m not asking for names; no need to rat out your best friend who works in Discover’s data mining department. I’m just curious to know what kind of background the people who come up with these things have. Artificial intelligence? Marketing research? Dental surgery?
  • How sophisticated are the rules used to screen customers for these mailings? Is there some serious business logic operating behind the scenes that happened to go wrong here, or is a well-meaning Discover employee just running SQL queries like “SELECT name, address FROM members WHERE description LIKE ‘%residence hall%'” on their lunch break?
  • Do credit card companies that do this kind of thing (which I imagine is pretty much all of them) actually validate their logic against test datasets (in this case, a large group of Discover members whose parental status has been independently verified), or do they just pick some criteria that seem to make sense and immediately start blanketing the United States with flyers?
  • What proportion of false positives is considered reasonable? Clearly, with any kind of program like this, some small number of customers is almost invariably going to get a letter that makes some very bad lifestyle assumptions. At what point does the risk of a backlash start to outweigh the potential for increased revenue? Obviously, the vast majority of people are probably going to chalk this type of thing down to a harmless error, but I imagine some small proportion of people are going to get upset and call up Discover to rant and rave about how they don’t have any children at all, and how dare Discover mine their records like this, and doesn’t Discover have any respect for them as loyal long-standing cardholders, and what’s that, why yes, of course, they’d be quite happy to accept Discover’s apology for this tragic error if it came with a two-for-one gift certificate to the Olive Garden.
  • Most importantly: is it considered fraud if I knowingly fill out an application for student loans in my lovely wife-daughter’s name?