I’m moving to Austin!

The title pretty much says it. After spending four great years in Colorado, I’m happy to say that I’ll be moving to Austin at the end of the month. I’ll be joining the Department of Psychology at UT-Austin as a Research Associate, where I plan to continue dabbling in all things psychological and informatic, but with less snow and more air conditioning.

While my new position nominally has the same title as my old one, the new one’s a bit unusual in that the funding is coming from two quite different sources. Half of it comes from my existing NIH grant for development of the Neurosynth framework, which means that half of my time will be spent more or less the same way I’m spending it now–namely, on building tools to improve and automate the large-scale synthesis of functional MRI data. (Incidentally, I’ll be hiring a software developer and/or postdoc in the very near future, so drop me a line if you think you might be interested.)

The other half of the funding is tied to the PsyHorns course developed by Jamie Pennebaker and Sam Gosling over the past few years. PsyHorns is a synchronous massive online course (SMOC) that lets anyone in the world with an internet connection (okay, and $550 in loose change lying around) take an introductory psychology class via the internet and officially receive credit for it from the University of Texas (this recent WSJ article on PsyHorns provides some more details). My role will be to serve as a bridge between the psychologists and the developers–which means I’ll have an eclectic assortment of duties like writing algorithms to detect cheating, developing tools to predict how well people are doing in the class, mining the gigantic reams of data we’re acquiring, developing ideas for new course features, and, of course, publishing papers.

Naturally, the PILab will be joining me in my southern adventure. Since the PILab currently only has one permanent member (guess who?), and otherwise consists of a single Mac Pro workstation, this latter move involves much less effort than you might think (though it does mean I’ll have to change the lab website’s URL, logo, and–horror of horrors–color scheme). Unfortunately, all the wonderful people of the PILab will be staying behind, as they all have various much more important ties to Boulder (by which I mean that I’m not actually currently paying any of their salaries, and none of them were willing to subsist on the stipend of baked beans, love, and high-speed internet I offered them).

While I’m super excited about moving to Austin, I’m not at all excited to leave Colorado. Boulder is a wonderful place to live*–it’s sunny all the time, has a compact, walkable core, a surprising amount of stuff to do, and these gigantic mountain things you can walk all over. My wife and I have made many incredible friends here, and after four years in Colorado, it’s come to feel very much like home. So leaving will be difficult. Still, I’m excited to move onto new things. As great as the past four years have been, a number of factors precipitated this move:

  • The research fit is better. This isn’t in any way a knock against the environment here at Colorado, which has been great (hey, they’re hiring! If you do computational cognitive neuroscience, you should apply!). I had great colleagues here who work on some really interesting questions–particularly Tor Wager, my postdoc advisor for my first 3 years here, who’s an exceptional scientist and stellar human being. But every department necessarily has to focus on some areas at the expense of others, and much of the research I do (or would ideally like to do) wasn’t well-represented here. In particular, my interests in personality and individual differences have languished during my time in Boulder, as I’ve had trouble finding collaborators for most of the project ideas I’ve had. UT-Austin, by contrast, has one of the premier personality and individual differences groups anywhere. I’m delighted to be working a few doors down from people like Sam Gosling, Jamie Pennebaker, Elliot Tucker-Drob, and David Buss. On top of that, UT-Austin still has major strengths in most of my other areas of interest, most notably neuroimaging (I expect to continue to collaborate frequently with Russ Poldrack) and data mining (a world-class CS department with an expanding focus on Big Data). So, purely in terms of fit, it’s hard for me to imagine a better place than UT.
  • I’m excited to work on a project with immediate real-world impact. While I’d love to believe that most of the work I currently do is making the world better in some very small way, the reality most scientists engaged in basic research face is that at the end of the day, we don’t actually know what impact we’re having. There’s nothing inherently wrong with that, mind you; as a general rule, I’m a big believer in the idea of doing science just because it’s interesting and exciting, without worrying about the consequences (or lack thereof). You know, knowledge for it’s own sake and all that. Still, on a personal level, I find myself increasingly wanting to do something that I feel confers some clear and measurable benefit on the world right now–however small. In that respect, online education strikes me as an excellent area to pour my energy into. And PsyHorns is a particularly unusual (and, to my mind, promising) experiment in online education. The preliminary data from previous iterations of the course suggests that students who take the course synchronously online do better academically–not just in this particular class (as compared to an in-class section), but in other courses as well. While I’m not hugely optimistic about the malleability of the human mind as a general rule–meaning, I don’t think there are as-yet undiscovered teaching approaches that are going to radically improve the learning experience–I do believe strongly in the cumulative impact of many small nudging in the right direction. I think this is the right platform for that kind of nudging.
  • Data. Lots and lots of data. Enrollment in PsyHorns this year is about 1,500 students, and previous iterations have seen comparable numbers. As part of their introduction to psychology, the students engage in a wide range of activities: they have group chats about the material they’re learning; they write essays about a range of topics; they fill out questionnaires and attitude surveys; and, for the first time this year, they use a mobile app that assesses various aspects of their daily experience. Aside from the feedback we provide to the students (some of which is potentially actionable right away), the data we’re collecting provides a unique opportunity to address many questions at the intersection of personality and individual differences, health and subjective well-being, and education. It’s not Big Data by, say, Google or Amazon standards (we’re talking thousands of rows rather than billions), but it’s a dataset with few parallels in psychology, and I’m thrilled to be able to work on it.
  • I like doing research more than I like teaching** or doing service work. Like my current position, the position I’m assuming at UT-Austin is 100% research-focused, with very little administrative or teaching overhead. Obviously, it doesn’t have the long-term security of a tenure-track position, but I’m okay with that. I’m still selectively applying for tenure-track positions (and turned one down this year in favor of the UT position), so it’s not as though I have any principled objections to the tenure stream. But short of a really amazing opportunity, I’m very happy with my current arrangement.
  • mmm, chocolatey Austin goodness...
    Austin seems like a pretty awesome place to live. Boulder is too, but after four years of living in a relatively small place (population: ~100,000), my wife and I are looking forward to living somewhere more city-like. We’ve opted to take the (expensive) plunge and live downtown–where we’ll be within walking distance of just about everything we need. By which of course I mean the chocolate fountain at the Whole Foods mothership.
  • The tech community in Austin is booming. Given that most of my work these days lies at the interface of psychology and informatics, and there are unprecedented opportunities for psychology-related data mining in industry these days, I’m hoping to develop better collaborations with people in industry–at both startups and established companies. While I have no intention of leaving academia in the near future, I do think psychologists have collectively failed to take advantage of the many opportunities to collaborate with folks in industry on interesting questions about human behavior–often at an unprecedented scale. I’ve done a terrible job of that myself, and fixing that is near the top of my agenda. So, hey, if you work at a tech company in Austin and have some data lying around that you think might shed new insights on what people feel, think, and do, let’s chat!
  • I guess sometimes you just get the itch to move onto something new. For me, this is that.

University of Texas Austin campus at sunset-dusk - aerial view

 

 

* Okay, it was an amazing place to live until the massive floods this past week rearranged rivers, roads, and lives. My wife and I  were fortunate enough to escape any personal or material damage, but many others were not so lucky. If you’d like to help, please consider making a donation.

** Actually, I love teaching. What I don’t love is all the stuff surrounding teaching.

…and then there were two!

Last year when I launched my lab (which, full disclosure, is really just me, plus some of my friends who were kind enough to let me plaster their names and faces on my website), I decided to call it the Psychoinformatics Lab (or PILab for short and pretentious), because, well, why not. It seemed to nicely capture what my research is about: psychology and informatics. But it wasn’t an entirely comfortable decision, because a non-trivial portion of my brain was quite convinced that everyone was going to laugh at me. And even now, after more than a year of saying I’m a “psychoinformatician” whenever anyone asks me what I do, I still feel a little bit fraudulent each time–as if I’d just said I was a member of the Estonian Cosmonaut program, or the president of the Build-a-Bear fan club*.

But then… just last week… everything suddenly changed! All in one fell swoop–in one tiny little nudge of a shove-this-on-the-internet button, things became magically better. And now colors are vibrating**, birds are chirping merry chirping songs–no, wait, those are actually cicadas–and the world is basking in a pleasant red glow of humming monitors and five-star Amazon reviews. Or something like that. I’m not so good with the metaphors.

Why so upbeat, you ask? Well, because as of this writing, there is no longer just the one lone Psychoinformatics Lab. No! Now there are not one, not three, not seven Psychoinformatics Labs, but… two! There are two Psychoinformatics Labs. The good Dr. Michael Hanke (of PyMVPA and NeuroDebian fame) has just finished putting the last coat of paint on the inside of his brand new cage Psychoinformatics Lab at the Otto-von-Guericke University Magdeburg in Magdeburg, Germany. No, really***: his startup package didn’t include any money for paint, so he had to barter his considerable programming skills for three buckets of Going to the Chapel (yes, that’s a real paint color).

The good Dr. Hanke drifts through interstellar space in search of new psychoinformatic horizons.

Anyway, in case you can’t tell, I’m quite excited about this. Not because it’s a sign that informatics approaches are making headway in psychology, or that pretty soon every psychology lab will have a high-performance computing cluster hiding in its closet (one can dream, right?). No sir. I’m excited for two much more pedestrian reasons. First, because from now on, any time anyone makes fun of me for calling myself a psychoinformatician, I’ll be able to say, with a straight face, well it’s not just me, you know–there are multiple ones of us doing this here research-type thing with the data and the psychology and the computers. And secondly, because Michael is such a smart and hardworking guy that I’m pretty sure he’s going to legitimize this whole enterprise and drag me along for the ride with him, so I won’t have to do anything else myself. Which is good, because if laziness was an olympic sport, I’d never leave the starting block.

No, but in all seriousness, Michael is an excellent scientist and an exceptional human being, and I couldn’t be happier for him in his new job as Lord Director of All Things Psychoinformatic (Eastern Division). You might think I’m only saying this because he just launched the world’s second PILab, complete with quote from yours truly on said lab’s website front page. Well, you’d be right. But still. He’s a pretty good guy, and I’m sure we’re going to see amazing things coming out of Magdeburg.

Now if anyone wants to launch PILab #3 (maybe in Asia or South America?), just let me know, and I’ll make you the same offer I made Michael: an envelope full of $1 bills (well, you know, I’m an academic–I can’t afford Benjamins just yet) and a blog post full of ridiculous superlatives.

 

* Perhaps that’s not a good analogy, because that one may actually exist.

** But seriously, in real life, colors should not vibrate. If you ever notice colors vibrating, drive to the nearest emergency room and tell them you’re seeing colors vibrating.

*** No, not really.

bio-, chemo-, neuro-, eco-informatics… why no psycho-?

The latest issue of the APS Observer features a special section on methods. I contributed a piece discussing the need for a full-fledged discipline of psychoinformatics:

Scientific progress depends on our ability to harness and apply modern information technology. Many advances in the biological and social sciences now emerge directly from advances in the large-scale acquisition, management, and synthesis of scientific data. The application of information technology to science isn’t just a happy accident; it’s also a field in its own right — one commonly referred to as informatics. Prefix that term with a Greek root or two and you get other terms like bioinformatics, neuroinformatics, and ecoinformatics — all well-established fields responsible for many of the most exciting recent discoveries in their parent disciplines.

Curiously, following the same convention also gives us a field called psychoinformatics — which, if you believe Google, doesn’t exist at all (a search for the term returns only 500 hits as of this writing; Figure 1). The discrepancy is surprising, because labels aside, it’s clear that psychological scientists are already harnessing information technology in powerful and creative ways — often reshaping the very way we collect, organize, and synthesize our data.

Here’s the picture that’s worth, oh, at least ten or fifteen words:

Figure 1. Number of Google search hits for informatics-related terms, by prefix.

You can read the rest of the piece here if you’re so inclined. Check out some of the other articles too; I particularly like Denny Borsboom’s piece on network analysis. EDIT: and Anna Mikulak’s piece on optogenetics! I forgot the piece on optogenetics! How can you not love optogenetics!

tracking replication attempts in psychology–for real this time

I’ve written a few posts on this blog about how the development of better online infrastructure could help address and even solve many of the problems psychologists and other scientists face (e.g., the low reliability of peer review, the ‘fudge factor’ in statistical reporting, the sheer size of the scientific literature, etc.). Actually, that general question–how we can use technology to do better science–occupies a good chunk of my research these days (see e.g., Neurosynth). One question I’ve been interested in for a long time is how to keep track not only of ‘successful’ studies (i.e., those that produce sufficiently interesting effects to make it into the published literature), but also replication failures (or successes of limited interest) that wind up in researchers’ file drawers. A couple of years ago I went so far as to build a prototype website for tracking replication attempts in psychology. Unfortunately, it never went anywhere, partly (okay, mostly) because the site really sucked, and partly because I didn’t really invest much effort in drumming up interest (mostly due to lack of time). But I still think the idea is a valuable one in principle, and a lot of other people have independently had the same idea (which means it must be right, right?).

Anyway, it looks like someone finally had the cleverness, time, and money to get this right. Hal Pashler, Sean Kang*, and colleagues at UCSD have been developing an online database for tracking attempted replications of psychology studies for a while now, and it looks like it’s now in beta. PsychFileDrawer is a very slick, full-featured platform that really should–if there’s any justice in the world–provide the kind of service everyone’s been saying we need for a long time now. If it doesn’t work, I think we’ll have some collective soul-searching to do, because I don’t think it’s going to get any easier than this to add and track attempted replications. So go use it!

 

*Full disclosure: Sean Kang is a good friend of mine, so I’m not completely impartial in plugging this (though I’d do it anyway). Sean also happens to be amazingly smart and in search of a faculty job right now. If I were you, I’d hire him.

Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?