building better platforms for evaluating science: a request for feedback

UPDATE 4/20/2012: a revised version of the paper mentioned below is now available here.

A couple of months ago I wrote about a call for papers for a special issue of Frontiers in Computational Neuroscience focusing on “Visions for Open Evaluation of Scientific Papers by Post-Publication Peer Review“. I wrote a paper for the issue, the gist of which is that many of the features scientists should want out of a next-generation open evaluation platform are already implemented all over the place in social web applications, so that building platforms for evaluating scientific output should be more a matter of adapting existing techniques than having to come up with brilliant new approaches. I’m talking about features like recommendation engines, APIs, and reputation systems, which you can find everywhere from Netflix to Pandora to Stack Overflow to Amazon, but (unfortunately) virtually nowhere in the world of scientific publishing.

Since the official deadline for submission is two months away (no, I’m not so conscientious that I habitually finish my writing assignments two months ahead of time–I just failed to notice that the deadline had been pushed way back), I figured I may as well use the opportunity to make the paper openly accessible right now in the hopes of soliciting some constructive feedback. This is a topic that’s kind of off the beaten path for me, and I’m not convinced I really know what I’m talking about (well, fine, I’m actually pretty sure I don’t know what I’m talking about), so I’d love to get some constructive criticism from people before I submit a final version of the manuscript. Not only from scientists, but ideally also from people with experience developing social web applications–or actually, just about anyone with good ideas about how to implement and promote next-generation evaluation platforms. I mean, if you use Netflix or reddit regularly, you’re pretty much a de facto expert on collaborative filtering and recommendation systems, right?

Anyway, here’s the abstract:

Traditional pre-publication peer review of scientific output is a slow, inefficient, and unreliable process. Efforts to replace or supplement traditional evaluation models with open evaluation platforms that leverage advances in information technology are slowly gaining traction, but remain in the early stages of design and implementation. Here I discuss a number of considerations relevant to the development of such platforms. I focus particular attention on three core elements that next-generation evaluation platforms should strive to emphasize, including (a) open and transparent access to accumulated evaluation data, (b) personalized and highly customizable performance metrics, and (c) appropriate short-term incentivization of the userbase. Because all of these elements have already been successfully implemented on a large scale in hundreds of existing social web applications, I argue that development of new scientific evaluation platforms should proceed largely by adapting existing techniques rather than engineering entirely new evaluation mechanisms. Successful implementation of open evaluation platforms has the potential to substantially advance both the pace and the quality of scientific publication and evaluation, and the scientific community has a vested interest in shifting towards such models as soon as possible.

You can download the PDF here (or grab it from SSRN here). It features a cameo by Archimedes and borrows concepts liberally from sites like reddit, Netflix, and Stack Overflow (with attribution, of course). I’d love to hear your comments; you can either leave them below or email me directly. Depending on what kind of feedback I get (if any), I’ll try to post a revised version of the paper here in a month or so that works in people’s comments and suggestions.

(fanciful depiction of) Archimedes, renowned ancient Greek mathematician and co-inventor (with Al Gore) of the open access internet repository

what aspirin can tell us about the value of antidepressants

There’s a nice post on Science-Based Medicine by Harriet Hall pushing back (kind of) against the increasingly popular idea that antidepressants don’t work. For context, there have been a couple of large recent meta-analyses that used comprehensive FDA data on clinical trials of antidepressants (rather than only published studies, which are biased towards larger, statistically significant, effects) to argue that antidepressants are of little or no use in mild or moderately-depressed people, and achieve a clinically meaningful benefit only in the severely depressed.

Hall points out that whether you think antidepressants have a clinically meaningful benefit or not depends on how you define clinically meaningful (okay, this sounds vacuous, but bear with me). Most meta-analyses of antidepressant efficacy reveal an effect size of somewhere between 0.3 and 0.5 standard deviations. Historically, psychologists consider effect sizes of 0.2, 0.5, and 0.8 standard deviations to be small, medium, and large, respectively. But as Hall points out:

The psychologist who proposed these landmarks [Jacob Cohen] admitted that he had picked them arbitrarily and that they had “no more reliable a basis than my own intuition.“ Later, without providing any justification, the UK’s National Institute for Health and Clinical Excellence (NICE) decided to turn the 0.5 landmark (why not the 0.2 or the 0.8 value?) into a one-size-fits-all cut-off for clinical significance.

She goes on to explain why this ultimately leaves the efficacy of antidepressants open to interpretation:

In an editorial published in the British Medical Journal (BMJ), Turner explains with an elegant metaphor: journal articles had sold us a glass of juice advertised to contain 0.41 liters (0.41 being the effect size Turner, et al. derived from the journal articles); but the truth was that the “glass“ of efficacy contained only 0.31 liters. Because these amounts were lower than the (arbitrary) 0.5 liter cut-off, NICE standards (and Kirsch) consider the glass to be empty. Turner correctly concludes that the glass is far from full, but it is also far from empty. He also points out that patients’ responses are not all-or-none and that partial responses can be meaningful.

I think this pretty much hits the nail on the head; no one really doubts that antidepressants work at this point; the question is whether they work well enough to justify their side effects and the social and economic costs they impose. I don’t have much to add to Hall’s argument, except that I think she doesn’t sufficiently emphasize how big a role scale plays when trying to evaluate the utility of antidepressants (or any other treatment). At the level of a single individual, a change of one-third of a standard deviation may not seem very big (then again, if you’re currently depressed, it might!). But on a societal scale, even canonically ‘small’ effects can have very large effects in the aggregate.

The example I’m most fond of here is Robert Rosenthal’s famous illustration of the effects of aspirin on heart attack. The correlation between taking aspirin daily and decreased risk of heart attack is, at best, .03 (I say at best because the estimate is based on a large 1988 study, but my understanding is that more recent studies have moderated even this small effect). In most domains of psychology, a correlation of .03 is so small as to be completely uninteresting. Most psychologists would never seriously contemplate running a study to try to detect an effect of that size. And yet, at a population level, even an r of .03 can have serious implications. Cast in a different light, what this effect means is that 3% of people who would be expected to have a heart attack without aspirin would be saved from that heart attack given a daily aspirin regimen. Needless to say, this isn’t trivial. It amounts to a potentially life-saving intervention for 30 out of every 1,000 people. At a public policy level, you’d be crazy to ignore something like that (which is why, for a long time, many doctors recommended that people take an aspirin a day). And yet, by the standards of experimental psychology, this is a tiny, tiny effect that probably isn’t worth getting out of bed for.

The point of course is that when you consider how many people are currently on antidepressants (millions), even small effects–and certainly an effect of one-third of a standard deviation–are going to be compounded many times over. Given that antidepressants demonstrably reduce the risk of suicide (according to Hall, by about 20%), there’s little doubt that tens of thousands of lives have been saved by antidepressants. That doesn’t necessarily justify their routine use, of course, because the side effects and costs also scale up to the societal level (just imagine how many millions of bouts of nausea could be prevented by eliminating antidepressants from the market!). The point is that just that, if you think the benefits of antidepressants outweigh their costs even slightly at the level of the average depressed individual, you’re probably committing yourself to thinking that they have a hugely beneficial impact at a societal level–and that holds true irrespective of whether the effects are ‘clinically meaningful’ by conventional standards.

in praise of self-policing

It’s IRB week over at The Hardest Science; Sanjay has an excellent series of posts (1, 2, 3) discussing some proposed federal rule changes to the way IRBs oversee research. The short of it is that the proposed changes are mostly good news for people who do minimal risk-type research with human subjects (i.e., stuff that doesn’t involve poking people with needles); if the changes pass as written, most of us will no longer have to file any documents with our IRBs before running our studies. We’ll just put in a short note saying we’ve determined that our studies are excused from review, and then we can start collecting data right away. It’ll work something like this*:

This doesn’t mean federal oversight of human subjects research will cease, of course. There will still be guidelines we all have to follow. But instead of making researchers jump through flaming hoops preemptively, enforcement will take place on an ad-hoc basis and via random audits. For the most part, the important decisions will be left to investigators rather than IRBs. For more details, see Sanjay’s excellent breakdown.

I also agree with Sanjay’s sentiment in his latest post that this is the right way to do things; researchers should police themselves, rather than employing an entire staff of people whose jobs it is to tell researchers how to safely and ethically do their research. In principle, the idea of having trained IRB analysts go over every study sounds nice; the problem is that it takes a very long time, generates a lot of extra work for everyone, and perhaps most problematically, sets up all sorts of perverse incentives. Namely, IRB analysts have an incentive to be pedantic (since they rarely lose their jobs if they ask for too much detail, but could be liable if they give too much leeway and something bad happens), and investigators have an incentive to off-load their conscience onto the IRB rather than actually having to think about the impact of their experiment on subjects. I catch myself doing this more often than I’d like, and I’m not really happy about it. (For instance, I recently found myself telling someone it was okay for them to present gruesome pictures to subjects “because the IRB doesn’t mind that”, and not because I thought the psychological impact was negligible. I gave myself twenty lashes for that one**.) I suspect that, aside from saving everyone a good deal of time and effort, placing the responsibility of doing research on researchers’ shoulders would actually lead them to give more, and not less, consideration to ethical issues.

Anyway, it remains to be seen whether the proposed rules actually pass in their current form. One of the interesting features of the situation is that IRBs may now perversely actually have an incentive to fight against these rules going into effect, since they’d almost certainly need to lay off staff if we move to a system where most studies are entirely excused from review. I don’t really think that this will be much of an issue, and on balance I’m sure university administrations recognize how much IRBs slow down research; but it still can’t hurt for those of us who do research with human subjects to stick our heads past the Department of Health and Human Service’s doors and affirm that excusing most non-invasive human subjects research from review is the right thing to do.


* I know, I know. I managed to go two whole years on this blog without a single lolcat appearance, and now I throw it all away for this. Sorry.

** With a feather duster.

in which I suffer a minor setback due to hyperbolic discounting

I wrote a paper with some collaborators that was officially published today in Nature Methods (though it’s been available online for a few weeks). I spent a year of my life on this (a YEAR! That’s like 30 years in opossum years!), so go read the abstract, just to humor me. It’s about large-scale automated synthesis of human functional neuroimaging data. In fact, it’s so about that that that’s the title of the paper*. There’s also a companion website over here, which you might enjoy playing with if you like brains.

I plan to write a long post about this paper at some point in the near future, but not today. What I will do today is tell you all about why I didn’t write anything about the paper much earlier (i.e., 4 weeks ago, when it appeared online), because you seem very concerned. You see, I had grand plans for writing a very detailed and wonderfully engaging multi-part series of blog posts about the paper, starting with the background and motivation for the project (that would have been Part 1), then explaining the methods we used (Part 2), then the results (III; let’s switch to Roman numerals for effect), then some of the implications (IV), then some potential applications and future directions (V), then some stuff that didn’t make it into the paper (VI), and then, finally, a behind-the-science account of how it really all went down (VII; complete with filmed interviews with collaborators who left the project early due to creative differences). A seven-part blog post! All about one paper! It would have been longer than the article itself! And all the supplemental materials! Combined! Take my word for it, it would have been amazing.

Unfortunately, like most everyone else, I’m a much better person in the future than I am in the present; things that would take me a week of full-time work in the Now apparently take me only five to ten minutes when I plan them three months ahead of time. If you plotted my temporal discounting curve for intellectual effort, it would look like this:

So that’s why my seven-part series of blog posts didn’t debut at the same time the paper was published online a few weeks ago. In fact, it hasn’t debuted at all. At this point, my much more modest goal is just to write a single much shorter post, which will no longer be able to DEBUT, but can at least slink into the bar unnoticed while everyone else is out on the patio having a smoke. And really, I’m only doing it so I can look myself in the eye again when I look myself in the mirror. Because it turns out it’s very hard to shave your face safely if you’re not allowed to look yourself in the eye. And my labmates are starting to call me PapercutMan, which isn’t really a superpower worth having.

So yeah, I’ll write something about this paper soon. But just to play it safe, I’m not going to operationally define ‘soon’ right now.

 

* Three “that”s in a row! What are the odds! Good luck parsing that sentence!

sunbathers in America

This is fiction. Kind of. Science left for a few days and asked fiction to care for the house.


I ran into my friend, Cornelius Kipling, at the grocery store. He was ahead of me in line, holding a large eggplant and a copy of the National Enquirer. I didn’t ask about it.

I hadn’t seen Kip in six months, so went for a walk along Boulder Creek to catch up. Kip has a Ph.D. in molecular engineering from Ben-Gurion University of the Negev, and an MBA from an online degree mill. He’s the only person I know who combines an earnest desire to save the world with the scruples of a small-time mafia don. He’s an interesting person to talk as long as you remember that he gets most of his ideas out of mail-order catalogs.

“What are you working on these days,” I asked him after I’d stashed my groceries in the fridge and retrieved my wallet from his pocket. Last I’d heard Kip was involved in a minor arson case and couldn’t come within three thousand feet of any Monsanto office.

“Saving lives,” he said, in the same matter-of-fact way that a janitor will tell you he cleans bathrooms. “Small lives. Fireflies. I’m making miniature organic light-emitting diodes that save fireflies from certain death at the hands of the human industrial-industrial complex.”

“The industrial human what?”

“Exactly,” he said, ignoring the question. “We’re developing new LEDs that mimic the light fireflies give off. The purpose of the fire in fireflies, you see, is to attract mates. Bigger light, better mate. The problem is, humans have much bigger lights than fireflies. So fireflies end up trying to mate with incandescents. You turn on a light bulb outside, and pffftttt there go a dozen bugs. It’s genocide, only on a larger scale. Whereas the LEDs we’re building attract fireflies like crazy but aren’t hot enough to harm them. At worst, you’ve got a device guaranteed to start a firefly orgy when it turns on.”

“Well, that absolutely sounds like another winning venture,” I said. “Oh, hey, what happened to the robot-run dairy you were going to start?”

“The cow drowned,” he said wistfully. We spent a few moments in silence while I waited for conversational manna to rain down on my head. It didn’t.

“I didn’t mean to mock you,” I said finally. “I mean, yes, of course I meant to mock you. But with love. Not like an asshole. You know.”

“S’okay. Your sarcasm is an ephemeral, transient thing–like summer in the Yukon–but the longevity of the firefly is a matter of life and death.”

“Sure it is,” I said. “For the fireflies.”

“This is the potential impact of my work right now,” Kip said, holding his hands a foot apart, as if he were cupping a large balloon. “The oldest firefly in captivity just turned forty-one. That’s eleven years older than us. But in the wild, the average firefly only lives six weeks. Mostly because of contact with the residues of the industrial-industrial complex. Compact fluorescents, parabolic aluminized reflectors, MR halogens, Rizzuto globes, and regular old incandescents. Historically, the common firefly stood no chance against us. But now, I am its redress. I am the Genghis Khan of the Lampyridae Mongol herd. Prepare to be pillaged.”

“I think you just make this stuff up,” I said, wincing at the analogy. “I mean, I’m not one hundred percent sure. But I’m very close to one hundred percent sure.”

“Your envy of other people’s imagination is your biggest problem,” said Kip, rubbing his biceps in lazy circles through his shirt. “And my biggest problem is: I need more imaginative friends. Just this morning, in the shower, this question popped into my head, and it’s been bugging me ever since: if you could be any science fiction character, who would you be? But I can’t ask you what you think; you have no vision. You didn’t even ask me why I was checking out with nothing but an eggplant when you saw me at the grocery store.”

“It’s not a vision problem,” I said. “It’s strictly a science fiction problem. I’m just no good at it. I’ll sit down to read a Ben Bova book, and immediately my egg timer will go off, or I’ll remember I need to renew my annual subscription to Vogue. That stuff never happens when I read Jane Austen or Asterix. Plus, I have this long-standing fear that if I read a lot of sci-fi, I’ll learn too much about the future; more than is healthy for any human being to know. There are like three hundred thousand science fiction novels in print, but we only have one future between all of us. The odds are good that at least one of those novels is basically right about what will happen. I won’t even watch a ninety-minute slasher film if someone tells me ahead of time that the killer is the girl from Ipanema with the dragon tattoo; why would I want to read all that science fiction and find out that thirty years from now, sentient goats from Zorbon will land on Mt. Rushmore and enslave us all, starting with the lawyers?”

“See,” he said. “No answer. Simple question, but no answer.”

“Fine,” I said. “If I must. Hari Seldon.”

“Good. Why?”

“Because,” I said, “unlike the real world, Hari Seldon lives in a mysterious future where psychologists can actually predict people’s behavior.”

“Predicting things is not so hard,” said Kip. “Take for instance the weather. It’s like ninety-three degrees today, which means the nudists will be out in force on the rocks by the Gold Run condos. It’s the only time they have a legitimate excuse to expose their true selves.”

We walked another fifty paces.

“See?” he said, as we stepped off a bridge and rounded a corner along the path. “There they are.”

I nodded. There they were: young, old, and pantsless all over.

“Personally, I always wanted to be Superman,” Kip said as we kept walking. He traced an S through his sweat-stained shirt. “Like every other kid I guess. But then when I hit puberty, I realized being Superman is a lot of responsibility. You can’t sit naked on the rocks on a hot day. Not when you’re Superman. You can’t really do anything just for fun. You can’t punch a hole in the wall to annoy your neighbor who smokes a pack a day and makes the whole building smell like stale menthol. You can’t even use your x-ray vision to stare at his wife in the shower. You need a reason for everything you do; the citizens of Metropolis demand accountability. So instead of being Superman, I figured I’d keep the S on the chest, but make it stand for ‘Science’. And now my guiding philosophy is to go through life always performing random acts of scientific kindness but never explicitly committing to help anyone. That way I can be a fundamentally decent human being who still occasionally pops into a titty bar for a late buffet-style lunch.”

I stared at him in awe, amazed that so much light and air could stream out of one man’s ego. I think in his mind, Kip really believed that spending all of his time on personal science projects put him on the side of the angels. That St. Peter himself would one day invite him through the Pearly Gates just to hang out and compare notes on fireflies. And then of course Kip would get to tell St. Peter, “no thanks,” and march right past him into a strip club.

My mental cataloging of Kip’s character flaws was broken up by an American White Pelican growling loudly somewhere in the sky above us. It spun around a few times before divebombing into the creek–an ambivalently graceful entrance reminiscent of Greg Louganis at the ’88 Olympics. American White Pelicans aren’t supposed to plunge-dive for food, but I guess that’s the beauty of America; anyone can exercise their individuality at any given moment. You can get Superman, floating above Metropolitan landmarks, eyeing anonymous bathrooms and wishing he could use his powers for evil instead of good; Cornelius Kipling, with ideas so grand and unattainable they crush out every practical instinct in his body; and me, with my theatrical vision of myself–starring myself, as Hari Seldon, the world’s first useful psychologist!

And all of us just here for a brief flash in the goldpan of time; just temporary sunbathers in America.

“You’re overthinking things again,” Kip said from somewhere outside my head. “I can tell. You’ve got that dumb look on your face that says you think you have a really deep thought on your face. Well, you don’t. You know what, forget the books; the nudists have the right idea. Go lie on the grass and pour some goddamn sunshine on your skin. You look even whiter than I remembered.”

we, the people, who make mistakes–economists included

Andrew Gelman discusses a “puzzle that’s been bugging [him] for a while“:

Pop economists (or, at least, pop micro-economists) are often making one of two arguments:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?“ sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better“ claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

The trick is knowing whether you’re gonna get 1 or 2 above. They’re complete opposites!

Personally what I find puzzling isn’t really how to reconcile these two strands (which do seem to somehow coexist quite peacefully in pop economists’ writings); it’s how anyone–economist or otherwise–still manages to believe people are rational in any meaningful sense (and I’m not saying Andrew does; in fact, see below).

There are at least two non-trivial ways to define rationality. One is in terms of an ideal agent’s actions–i.e., rationality is what a decision-maker would choose to do if she had unlimited cognitive resources and knew all the information relevant to a given decision. Well, okay, maybe not an ideal agent, but at the very least a very smart one. This is the sense of rationality in which you might colloquially remark to your neighbor that buying lottery tickets is an irrational thing to do, because the odds are stacked against you. The expected value of buying a lottery ticket (i.e., the amount you would expect to end up with in the long run) is generally negative, so in some normative sense, you could say it’s irrational to buy lottery tickets.

This definition of irrationality is probably quite close to the colloquial usage of the term, but it’s not really interesting from an academic standpoint, because nobody (economists included) really believes we’re rational in this sense. It’s blatantly obvious to everyone that none of us really make normatively correct choices much of the time. If for no other reason than we are all somewhat lacking in the omniscience department.

What economists mean when they talk about rationality is something more technical; specifically, it’s that people manifest stationary preferences. That is, given any set of preferences an individual happens to have (which may seem completely crazy to everyone else), rationality implies that that person expresses those preferences in a consistent manner. If you like dark chocolate more than milk chocolate, and milk chocolate more than Skittles, you shouldn’t like Skittles more than dark chocolate. If you do, you’re violating the principle of transitivity, which would effectively make it impossible to model your preferences formally (since we’d have no way of telling what you’d prefer in any given situation). And that would be a problem for standard economic theory, which is based on the assumption that people are fundamentally rational agents (in this particular sense).

The reason I say it’s puzzling that anyone still believes people are rational in even this narrower sense is that decades of behavioral economics and psychology research have repeatedly demonstrated that people just don’t have consistent preferences. You can radically influence and alter decision-makers’ behavior in all sorts of ways that simply aren’t predicted or accounted for by Rational Choice Theory (RCT). I’ll give just two examples here, but there are any number of others, as many excellent books attest (e.g., Dan Ariely‘s Predictably Irrational, or Thaler and Sunstein’s Nudge).

The first example stems from famous work by Madrian and Shea (2001) investigating the effects of savings plan designs on employees’ 401(k) choices. By pretty much anyone’s account, decisions about savings plans should be a pretty big deal for most employees. The difference between opting into a 401(k) and opting out of one can easily amount to several hundred thousand dollars over the course of a lifetime, so you would expect people to have a huge incentive to make the choice that’s most consistent with their personal preferences (whether those preferences happen to be for splurging now or saving for later). Yet what Madrian and Shea convincingly showed was that most employees simply go with the default plan option. When companies switch from opt-in to opt-out (i.e., instead of calling up HR and saying you want to join the plan, you’re enrolled by default, and have to fill out a form if you want to opt out), nearly 50% more employees end up enrolled in the 401(k).

This result (and any number of others along similar lines) makes no sense under rational choice theory, because it’s virtually impossible to conceive of a consistent set of preferences that would explain this type of behavior. Many of the same employees who won’t take ten minutes out of their day to opt in or out of their 401(k) will undoubtedly drive across town to save a few dollars on their groceries; like most people, they’ll look for bargains, buy cheaper goods rather than more expensive ones, worry about leaving something for their children after they’re gone, and so on and so forth. And one can’t simply attribute the discrepancy in behavior to ignorance (i.e., “no one reads the fine print!”), because the whole point of massive incentives is that they’re supposed to incentivize you to do things like look up information that could be relevant to, oh, say, having hundreds of thousands of extra dollars in your bank account in forty years. If you’re willing to look for coupons in the sunday paper to save a few dollars, but aren’t willing to call up HR and ask about your savings plan, there is, to put it frankly, something mildly inconsistent about your preferences.

The other example stems from the enormous literature on risk aversion. The classic risk aversion finding is that most people require a higher nominal payoff on risky prospects than on safe ones before they’re willing to accept the risky prospect. For instance, most people would rather have $10 for sure than $50 with 25% probability, even though the expected value of the latter is 25% higher (an amazing return!). Risk aversion is a pervasive phenomenon, and crops up everywhere, including in financial investments, where it is known as the equity premium puzzle (the puzzle being that many investors prefer bonds to stocks even though the historical record suggests a massively higher rate of return for stocks over the long term).

From a naive standpoint, you might think the challenge risk aversion poses to rational choice theory is that risk aversion is just, you know, stupid. Meaning, if someone keeps offering you $10 with 100% probability or $50 with 25% probability, it’s stupid to keep making the former choice (which is what most people do when you ask them) when you’re going to make much more money by making the latter choice. But again, remember, economic rationality isn’t about preferences per se, it’s about consistency of preferences. Risk aversion may violate a simplistic theory under which people are supposed to simply maximize expected value at all times; but then, no one’s really believed that for  several hundred years. The standard economist’s response to the observation that people are risk averse is to observe that people aren’t maximizing expected value, they’re maximizing utility. Utility has a non-linear relationship with expected value, so that people assign different weight to the Nth+1 dollar earned than to the Nth dollar earned. For instance, the classical value function identified by Kahneman and Tversky in their seminal work (for which Kahneman won the Nobel prize in part) looks like this:

The idea here is that the average person overvalues small gains relative to larger gains; i.e., you may be more satisfied when you receive $200 than when you receive $100, but you’re not going to be twice as satisfied.

This seemed like a sufficient response for a while, since it appears to preserve consistency as the hallmark of rationality. The idea is that you can have people who have more or less curvature in their value and probability weighting functions (i.e., some people are more risk averse than others), and that’s just fine as long as those preferences are consistent. Meaning, it’s okay if you prefer $50 with 25% probability to $10 with 100% probability just as long as you also prefer $50 with 25% probability to $8 with 100% probability, or to $7 with 100% probability, and so on. So long as your preferences are consistent, your behavior can be explained by RCT.

The problem, as many people have noted, is that in actuality there isn’t any set of consistent preferences that can explain most people’s risk averse behavior. A succinct and influential summary of the problem was provided by Rabin (2000), who showed formally that the choices people make when dealing with small amounts of money imply such an absurd level of risk aversion that the only way for them to be consistent would be to reject uncertain prospects with an infinitely large payoff even when the certain payoff was only modestly larger. Put differently,

if a person always turns down a 50-50 lose $100/gain $110 gamble, she will always turn down a 50-50 lose $800/gain $2,090 gamble. … Somebody who always turns down 50-50 lose $100/gain $125 gambles will turn down any gamble with a 50% chance of losing $600.

The reason for this is simply that any concave function that crosses the points expressed by the low-magnitude prospects (e.g., a refusal to take a 50-50 bet with lose $100/gain $110 outcomes) will have to asymptote fairly quickly. So for people to have internally consistent preferences, they would literally have to be turning down infinite but uncertain payoffs for certain but modest ones. Which of course is absurd; in practice, you would have a hard time finding many people who would refuse a coin toss where they lose $600 on heads and win $$$infinity dollarz$$$ on tails. Though you might have a very difficult time convincing them you’re serious about the bet. And an even more difficult time finding infinity trucks with which to haul in those infinity dollarz in the event you lose.

Anyway, these are just two prominent examples; there are literally hundreds of other similar examples in the behavioral economics literature of supposedly rational people displaying wildly inconsistent behavior. And not just a minority of people; it’s pretty much all of us. Presumably including economists. Irrationality, as it turns out, is the norm and not the exception. In some ways, what’s surprising is not that we’re inconsistent, but that we manage to do so well despite our many biases and failings.

To return to the puzzle Andrew Gelman posed, though, I suspect Andrew’s being facetious, and doesn’t really see this as much of a puzzle at all. Here’s his solution:

The key, I believe, is that “rationality“ is a good thing. We all like to associate with good things, right? Argument 1 has a populist feel (people are rational!) and argument 2 has an elitist feel (economists are special!). But both are ways of associating oneself with rationality. It’s almost like the important thing is to be in the same room with rationality; it hardly matters whether you yourself are the exemplar of rationality, or whether you’re celebrating the rationality of others.

This seems like a somewhat more tactful way of saying what I suspect Andrew and many other people (and probably most academic psychologists, myself included) already believe, which is that there isn’t really any reason to think that people are rational in the sense demanded by RCT. That’s not to say economics is bunk, or that it doesn’t make sense to think about incentives as a means of altering behavior. Obviously, in a great many situations, pretending that people are rational is a reasonable approximation to the truth. For instance, in general, if you offer more money to have a job done, more people will be willing to do that job. But the fact that the tenets of standard economics often work shouldn’t blind us to the fact that they also often don’t, and that they fail in many systematic and predictable ways. For instance, sometimes paying people more money makes them perform worse, not better. And sometimes it saps them of the motivation to work at all. Faced with overwhelming empirical evidence that people don’t behave as the theory predicts, the appropriate response should be to revisit the theory, or at least to recognize which situations it should be applied in and which it shouldn’t.

Anyway, that’s a long-winded way of saying I don’t think Andrew’s puzzle is really a puzzle. Economists simply don’t express their own preferences and views about consistency consistently, and it’s not surprising, because neither does anyone else. That doesn’t make them (or us) bad people; it just makes us all people.

amusing evidence of a lazy cut and paste job

In the course of a literature search, I came across the following abstract, from a 1990 paper titled “Taking People at Face Value: Evidence for the Kernel of Truth Hypothesis”, and taken directly from the publisher’s website:

Two studies examined the validity of impressions based on static facial appearance. In Study 1, the content of previously unacquainted classmates’ impressions of one another was assessed during the 1st, 5th, and 9th weeks of the semester. These impressions were compared with ratings of facial photographs of the participants that were provided by a separate group of unacquainted judges. Impressions based on facial appearance alone predicted impressions provided by classmates after up to 9 weeks of acquaintance. Study 2 revealed correspondences between self ratings provided by stimulus persons, and ratings of their faces provided by unacquainted judges. Mechanisms by which these links may develop are discussed.

Now fully revealed by the fire and candlelight, I was amazed more than ever to behold the transformation of Heathcliff. His countenance was much older in expression and decision of feature than Mr. Linton’s; it looked intelligent and retained no marks of former degradation. A half civilized ferocity lurked yet in the depressed brows and eyes full of black fire, but it was subdued.

 

Apparently social psychology was a much more interesting place in 1990.

Some more investigation revealed the source of the problem. Here’s the first page of the PDF:

 

So it looks to be a lazy cut and paste job on the publisher’s part rather than a looking glass into the creative world of scientific writing in the early 1990s. Which I guess is for the best, otherwise Diane S. Berry would be on the hook for plagiarizing from Wuthering Heights. And not in a subtle way either.

the APS likes me!

Somehow I wound up profiled in this month’s issue of the APS Observer as a “Rising Star“. I’d like to believe this means I’m a really big deal now, but I suspect what it actually means is that someone on the nominating committee at APS has extraordinarily bad judgment. I say this in no small part because I know some of the other people who were named Rising Stars quite well (congrats to Karl Szpunar,  Jason Chan, and Alan Castel, among many other people!), so I’m pretty sure I can distinguish people who actually deserve this from, say, me.

Of course, I’m not going to look a gift horse in the mouth. And I’m certainly thrilled to be picked for this. I know these things are kind of a crapshoot, but it still feels really nice. So while the part of my brain that understands measurement error is saying “meh, luck of the draw,” that other part of my brain that likes to be told it’s awesome is in the middle of a three day coke bender right now*. The only regret both parts of the brain have is that there isn’t any money attached to the award–or even a token prize like, say, a free statistician for a year. But I don’t think I’m going to push my luck by complaining to APS about it.

One thing I like a lot about the format of the Rising Star awards is they give you a full page to talk about yourself and your research. If there’s one thing I like to talk about, it’s myself. Usually, you can’t talk about yourself for very long before people start giving you dirty looks. But in this case, it’s sanctioned, so I guess it’s okay. In any case, the kind folks at the Observer sent me a series of seven questions to answer. And being an upstanding gentleman who likes to be given fancy awards, I promptly obliged. I figured they would just run what I sent them with minor edits… but I WAS VERY WRONG. They promptly disassembled nearly all of my brilliant observations and advice and replaced them with some very tame ramblings. So if you actually bother to read my responses, and happen to fall asleep halfway through, you’ll know who to blame. But just to set the record straight, I figured I would run through each of the boilerplate questions I was asked, and show you the answer that was printed in the Observer as compared to what I actually wrote**:

What does your research focus on?

What they printed: Most of my current research focuses on what you might call psychoinformatics: the application of information technology to psychology, with the aim of advancing our ability to study the human mind and brain. I’m interested in developing new ways to acquire, synthesize, and share data in psychology and cognitive neuroscience. Some of the projects I’ve worked on include developing new ways to measure personality more efficiently, adapting computer science metrics of string similarity to visual word recognition, modeling fMRI data on extremely short timescales, and conducting large-scale automated synthesis of published neuroimaging findings. The common theme that binds these disparate projects together is the desire to develop new ways of conceptualizing and addressing psychological problems; I believe very strongly in the transformative power of good methods.

What I actually said: I don’t know! There’s so much interesting stuff to think about! I can’t choose!

What drew you to this line of research? Why is it exciting to you?

What they printed: Technology enriches and improves our lives in every domain, and science is no exception. In the biomedical sciences in particular, many revolutionary discoveries would have been impossible without substantial advances in information technology. Entire subfields of research in molecular biology and genetics are now synonymous with bioinformatics, and neuroscience is currently also experiencing something of a neuroinformatics revolution. The same trend is only just beginning to emerge in psychology, but we’re already able to do amazing things that would have been unthinkable 10 or 20 years ago. For instance, we can now collect data from thousands of people all over the world online, sample people’s inner thoughts and feelings in real time via their phones, harness enormous datasets released by governments and corporations to study everything from how people navigate their spatial world to how they interact with their friends, and use high-performance computing platforms to solve previously intractable problems through large-scale simulation. Over the next few years, I think we’re going to see transformative changes in the way we study the human mind and brain, and I find that a tremendously exciting thing to be involved in.

What I actually said: I like psychology a lot, and I like technology a lot. Why not combine them!

Who were/are your mentors or psychological influences?

What they printed: I’ve been fortunate to have outstanding teachers and mentors at every stage of my training. I actually started my academic career quite disinterested in science and owe my career trajectory in no small part to two stellar philosophy professors (Rob Stainton and Chris Viger) who convinced me as an undergraduate that engaging with empirical data was a surprisingly good way to discover how the world really works. I can’t possibly do justice to all the valuable lessons my graduate and postdoctoral mentors have taught me, so let me just pick a few out of a hat. Among many other things, Todd Braver taught me how to talk through problems collaboratively and keep recursively questioning the answers to problems until a clear understanding materializes. Randy Larsen taught me that patience really is a virtue, despite my frequent misgivings. Tor Wager has taught me to think more programmatically about my research and to challenge myself to learn new skills. All of these people are living proof that you can be an ambitious, hard-working, and productive scientist and still be extraordinarily kind and generous with your time. I don’t think I embody those qualities myself right now, but at least I know what to shoot for.

What I actually said: Richard Feynman, Richard Hamming, and my mother. Not necessarily in that order.

To what do you attribute your success in the science?

What they printed: Mostly to blind luck. So far I’ve managed to stumble from one great research and mentoring situation to another. I’ve been fortunate to have exceptional advisors who’ve provided me with the perfect balance of freedom and guidance and amazing colleagues and friends who’ve been happy to help me out with ideas and resources whenever I’m completely out of my depth — which is most of the time.

To the extent that I can take personal credit for anything, I think I’ve been good about pursuing ideas I’m passionate about and believe in, even when they seem unlikely to pay off at first. I’m also a big proponent of exploratory research; I think pure exploration is tremendously undervalued in psychology. Many of my projects have developed serendipitously, as a result of asking, “What happens if we try doing it this way?”

What I actually said: Mostly to blind luck.

What’s your future research agenda?

What they printed: I’d like to develop technology-based research platforms that improve psychologists’ ability to answer existing questions while simultaneously opening up entirely new avenues of research. That includes things like developing ways to collect large amounts of data more efficiently, tracking research participants over time, automatically synthesizing the results of published studies, building online data repositories and collaboration tools, and more. I know that all sounds incredibly vague, and if you have some ideas about how to go about any of it, I’d love to collaborate! And by collaborate, I mean that I’ll brew the coffee and you’ll do the work.

What I actually said: Trading coffee for publications?

Any advice for even younger psychological scientists? What would you tell someone just now entering graduate school or getting their PhD?

What they printed: The responsible thing would probably be to say “Don’t go to graduate school.” But if it’s too late for that, I’d recommend finding brilliant mentors and colleagues and serving them coffee exactly the way they like it. Failing that, find projects you’re passionate about, work with people you enjoy being around, develop good technical skills, and don’t be afraid to try out crazy ideas. Leave your office door open, and talk to everyone you can about the research they’re doing, even if it doesn’t seem immediately relevant. Good ideas can come from anywhere and often do.

What I actually said: “Don’t go to graduate school.”

What publication you are most proud of or feel has been most important to your career?

What they printed: Yarkoni, T., Poldrack, R. A., Nichols, T. E., Van Essen, D. C., & Wager, T. D. (2011). Large-scale automated synthesis of human functional neuroimaging data. Manuscript submitted for publication.

In this paper, we introduce a highly automated platform for synthesizing data from thousands of published functional neuroimaging studies. We used a combination of text mining, meta-analysis, and machine learning to automatically generate maps of brain activity for hundreds of different psychological concepts, and we showed that these results could be used to “decode” cognitive states from brain activity in individual human subjects in a relatively open-ended way. I’m very proud of this work, and I’m quite glad that my co-authors agreed to make me first author in return for getting their coffee just right. Unfortunately, the paper isn’t published yet, so you’ll just have to take my word for it that it’s really neat stuff. And if you’re thinking, “Isn’t it awfully convenient that his best paper is unpublished?”… why, yes. Yes it is.

What I actually said: …actually, that’s almost exactly what I said. Except they inserted that bit about trading coffee for co-authorship. Really all I had to do was ask my co-authors nicely.

Anyway, like I said, it’s really nice to be honored in this way, even if I don’t really deserve it (and that’s not false modesty–I’m generally the first to tell other people when I think I’ve done something awesome). But I’m a firm believer in regression to the mean, so I suspect the run of good luck won’t last. In a few years, when I’ve done almost no new original work, failed to land a tenure-track job, and dropped out of academia to ride horses around the racetrack***, you can tell people that you knew me back when I was a Rising Star. Right before you tell them you don’t know what the hell happened.

———————————-

* But not really.

** Totally lying. Pretty much every word is as I wrote it. And the Observer staff were great.

*** Hopefully none of these things will happen. Except the jockey thing; that would be awesome.

CNS 2011: a first-person shorthand account in the manner of Rocky Steps

Friday, April 1

4 pm. Arrive at SFO International on bumpy flight from Denver.

4:45 pm. Approach well-dressed man downtown and open mouth to ask for directions to Hyatt Regency San Francisco. “Sorry,” says well-dressed man, “No change to give.” Back off slowly, swinging bags, beard, and poster tube wildly, mumbling “I’m not a panhandler, I’m a neuroscientist.” Realize that difference between the two may be smaller than initially suspected.

6:30 pm. Hear loud knocking on hotel room door. Open door to find roommate. Say hello to roommate. Realize roommate is extremely drunk from East Coast flight. Offer roommate bag of coffee and orange tic-tacs. Roommate is confused, asks, “are you drunk?” Ignore roommate’s question. “You’re drunk, aren’t you.” Deny roommate’s unsubstantiated accusations. “When you write about this on your blog, you better not try to make it look like I’m the drunk one,” roommate says. Resolve to ignore roommate’s crazy talk for next 4 days.

6:45 pm. Attempt to open window of 10th floor hotel room in order to procure fresh air for face. Window refuses to open. Commence nudging of, screaming at, and bargaining with window. Window still refuses to open. Roommate points out sticker saying window does not open. Ignore sticker, continue berating window. Window still refuses to open, but now has low self-esteem.

8 pm. Have romantic candlelight dinner at expensive french restaurant with roommate. Make jokes all evening about ideal location (San Francisco) for start of new intimate relationship. Suspect roommate is uncomfortable, but persist in faux wooing. Roommate finally turns tables by offering to put out. Experience heightened level of discomfort, but still finish all of steak tartare and order creme brulee. Dessert appetite is immune to off-color humor!

11 pm – 1 am. Grand tour of seedy SF bars with roommate and old grad school friend. New nightlife low: denied entrance to seedy dance club because shoes insufficiently classy. Stupid Teva sandals.

Saturday, April 2

9:30 am. Wake up late. Contemplate running downstairs to check out ongoing special symposium for famous person who does important research. Decide against. Contemplate visiting hotel gym to work off creme brulee from last night. Decide against. Contemplate reading conference program in bed and circling interesting posters to attend. Decide against. Contemplate going back to sleep. Consult with self, make unanimous decision in favor.

1 pm. Have extended lunch meeting with collaborators at Ferry Building to discuss incipient top-secret research project involving diesel generator, overstock beanie babies, and apple core. Already giving away too much!

3:30 pm. Return to hotel. Discover hotel is now swarming with name badges attached to vaguely familiar faces. Hug vaguely familiar faces. Hugs are met with startled cries. Realize that vaguely familiar faces are actually completely unfamiliar faces. Wrong conference: Young Republicans, not Cognitive Neuroscientists. Make beeline for elevator bank, pursued by angry middle-aged men dressed in American flags.

5 pm. Poster session A! The sights! The sounds! The lone free drink at the reception! The wonders of yellow 8-point text on black 6′ x 4′ background! Too hard to pick a favorite thing, not even going to try. Okay, fine: free schwag at the exhibitor stands.

5 pm – 7 pm. Chat with old friends. Have good time catching up. Only non-fictionalized bullet point of entire piece.

8 pm. Dinner at belly dancing restaurant in lower Haight. Great conversation, good food, mediocre dancing. Towards end of night, insist on demonstrating own prowess in fine art of torso shaking; climb on table and gyrate body wildly, alternately singing Oompa-Loompa song and yelling “get in my belly!” at other restaurant patrons. Nobody tips.

12:30 am. Take the last train to Clarksville. Take last N train back to Hyatt Regency hotel.

Sunday, April 3

7 am. Wake up with amazing lack of hangover. Celebrate amazing lack of hangover by running repeated victory laps around 10th floor of Hyatt Regency, Rocky Steps style. Quickly realize initial estimate of hangover absence off by order of magnitude. Revise estimate; collapse in puddle on hotel room floor. Refuse to move until first morning session.

8:15 am. Wander the eight Caltech aisles of morning poster session in search of breakfast. Fascinating stuff, but this early in morning, only value signals of interest are smell and sight of coffee, muffins, and bagels.

10 am. Terrific symposium includes excellent talks about emotion, brain-body communication, and motivation, but favorite moment is still when friend arrives carrying bucket of aspirin.

1 pm. Bump into old grad school friend outside; decide to grab lunch on pier behind Ferry Building. Discuss anterograde amnesia and dating habits of mutual friends. Chicken and tofu cake is delicious. Sun is out, temperature is mild; perfect day to not attend poster sessions.

1:15 – 2 pm. Attend poster session.

2 pm – 5 pm. Presenting poster in 3 hours! Have full-blown panic attack in hotel room. Not about poster, about General Hospital. Why won’t Lulu take Dante’s advice and call support group number for alcoholics’ families?!?! Alcohol is Luke’s problem, Lulu! Call that number!

5 pm. Present world’s most amazing poster to three people. Launch into well-rehearsed speech about importance of work and great glory of sophisticated technical methodology before realizing two out of three people are mistakenly there for coffee and cake, and third person mistook presenter for someone famous. Pause to allow audience to mumble excuses and run to coffee bar. When coast is clear, resume glaring at anyone who dares to traverse poster aisle. Believe strongly in marking one’s territory.

8 pm. Lab dinner at House of Nanking. Food is excellent, despite unreasonably low tablespace-to-floorspace ratio. Conversation revolves around fainting goats, ‘relaxation’ in Thailand, and, occasionally, science.

10 pm. Karaoke at The Mint. Compare performance of CNS attendees with control group of regulars; establish presence of robust negative correlation between years of education and singing ability. Completely wreck voice performing whitest rendition ever of Shaggy’s “Oh Carolina”. Crowd jeers. No, wait, crowd gyrates. In wholesome scientific manner. Crowd is composed entirely of people with low self-monitoring skills; what luck! DJ grimaces through entire song and most of previous and subsequent songs.

2 am. Take cab back to hotel with graduate students and Memory Professor. Memory Professor is drunk; manages to nearly fall out of cab while cab in motion. In-cab conversation revolves around merits of dynamic programming languages. No consensus reached, but civility maintained. Arrival at hotel: all cab inhabitants below professorial rank immediately slip out of cab and head for elevators, leaving Memory Professor to settle bill. In elevator, Graduate Student A suggests that attempt to push Memory Professor out of moving cab was bad idea in view of Graduate Student A’s impending post-doc with Memory Professor. Acknowledge probable wisdom of Graduate Student A’s observation while simultaneously resolving to not adjust own degenerate behavior in the slightest.

2:15 am. Drink at least 24 ounces of water before attaining horizontal position. Fall asleep humming bars of Elliott Smith’s Angeles. Wrong city, but close enough.

Monday, April 4

8 am. Wake up hangover free again! For real this time. No Rocky Steps dance. Shower and brush teeth. Delicately stroke roommate’s cheek (he’ll never know) before heading downstairs for poster session.

8:30 am. Bagels, muffin, coffee. Not necessarily in that order.

9 am – 12 pm. Skip sessions, spend morning in hotel room working. While trying to write next section of grant proposal, experience strange sensation of time looping back on itself, like a snake eating its own tail, but also eating grant proposal at same time. Awake from unexpected nap with ‘Innovation’ section in mouth.

12:30 pm. Skip lunch; for some reason, not very hungry.

1 pm. Visit poster with screaming purple title saying “COME HERE FOR FREE CHOCOLATE.” Am impressed with poster title and poster, but disappointed by free chocolate selection: Dove eggs and purple Hershey’s kisses–worst chocolate in the world! Resolve to show annoyance by disrupting presenter’s attempts to maintain conversation with audience. Quickly knocked out by chocolate eggs thrown by presenter.

5 pm. Wake up in hotel room with headache and no recollection of day’s events. Virus or hangover? Unclear. For some reason, hair smells like chocolate.

7:30 pm. Dinner at Ferry Building with Brain Camp friends. Have now visited Ferry Building at least one hundred times in seventy-two hours. Am now compulsively visiting Ferry Building every fifteen minutes just to feel normal.

9:30 pm. Party at Americano Restaurant & Bar for Young Investigator Award winner. Award comes with $500 and strict instructions to be spent on drinks for total strangers. Strange tradition, but noone complains.

11 pm. Bar is crowded with neuroscientists having great time at Young Investigator’s expense.

11:15 pm. Drink budget runs out.

11:17 pm. Neuroscientists mysteriously vanish.

1 am. Stroll through San Francisco streets in search of drink. Three false alarms, but finally arrive at open pub 10 minutes before last call. Have extended debate with friend over whether hotel room can be called ‘home’. Am decidedly in No camp; ‘home’ is for long-standing attachments, not 4-day hotel hobo runs.

2 am. Walk home.

Tuesday, April 5

9:05 am. Show up 5 minutes late for bagels and muffins. All gone! Experience Apocalypse Now moment on inside, but manage not to show it–except for lone tear. Drown sorrows in Tazo Wild Sweet Orange tea. Tea completely fails to live up to name; experience second, smaller, Apocalypse Now moment. Roommate walks over and asks if everything okay, then gently strokes cheek and brushes away lone tear (he knew!!!).

9:10 – 1 pm. Intermittently visit poster and symposium halls. Not sure why. Must be force of habit learning system.

1:30 pm. Lunch with friends at Thai restaurant near Golden Gate Park. Fill belly up with coconut, noodles, and crab. About to get on table to express gratitude with belly dance, but notice that friends have suddenly disappeared.

2 – 5 pm. Roam around Golden Gate Park and Haight-Ashbury. Stop at Whole Foods for friend to use bathroom. Get chased out of Whole Foods for using bathroom without permission. Very exciting; first time feeling alive on entire trip! Continue down Haight. Discuss socks, ice cream addiction (no such thing), and funding situation in Europe. Turns out it sucks there too.

5:15 pm. Take BART to airport with lab members. Watch San Francisco recede behind train. Sink into slightly melancholic state, but recognize change of scenery is for the best: constitution couldn’t handle more Rocky Steps mornings.

7:55 pm. Suddenly rediscover pronouns as airplane peels away from gate.

8 pm PST – 11:20 MST. The flight’s almost completely empty; I get to stretch out across the entire emergency exit aisle. The sun goes down as we cross the Sierra Nevada; the last of the ice in my cup melts into water somewhere between Provo and Grand Junction. As we start our descent into Denver, the lights come out in force, and I find myself preemptively bored at the thought of the long shuttle ride home. For a moment, I wish I was back in my room at the Hyatt at 8 am–about to run Rocky Steps around the hotel, or head down to the poster hall to find someone to chat with over a bagel and coffee. For some reason, I still feel like I didn’t get quite enough time to hang out with all the people I wanted to see, despite barely sleeping in 4 days. But then sanity returns, and the thought quickly passes.

what Paul Meehl might say about graduate school admissions

Sanjay Srivastava has an excellent post up today discussing the common belief among many academics (or at least psychologists) that graduate school admission interviews aren’t very predictive of actual success, and should be assigned little or no weight when making admissions decisions:

The argument usually goes something like this: “All the evidence from personnel selection studies says that interviews don’t predict anything. We are wasting people’s time and money by interviewing grad students, and we are possibly making our decisions worse by substituting bad information for good.“

I have been hearing more or less that same thing for years, starting when I was grad school myself. In fact, I have heard it often enough that, not being familiar with the literature myself, I accepted what people were saying at face value. But I finally got curious about what the literature actually says, so I looked it up.

I confess that I must have been drinking from the kool-aid spigot, because until I read Sanjay’s post, I’d long believed something very much like this myself, and for much the same reason. I’d never bothered to actually, you know, look at the data myself. Turns out the evidence and the kool-aid are not compatible:

A little Google Scholaring for terms like “employment interviews“ and “incremental validity“ led me to a bunch of meta-analyses that concluded that in fact interviews can and do provide useful information above and beyond other valid sources of information (like cognitive ability tests, work sample tests, conscientiousness, etc.). One of the most heavily cited is a 1998 Psych Bulletin paper by Schmidt and Hunter (link is a pdf; it’s also discussed in this blog post). Another was this paper by Cortina et al, which makes finer distinctions among different kinds of interviews. The meta-analyses generally seem to agree that (a) interviews correlate with job performance assessments and other criterion measures, (b) interviews aren’t as strong predictors as cognitive ability, (c) but they do provide incremental (non-overlapping) information, and (d) in those meta-analyses that make distinctions between different kinds of interviews, structured interviews are better than unstructured interviews.

This seems entirely reasonable, and I agree with Sanjay that it clearly shows that admissions interviews aren’t useless, at least in an actuarial sense. That said, after thinking about it for a while, I’m not sure these findings really address the central question admissions committees care about. When deciding which candidates to admit as students, the relevant question isn’t really what factors predict success in graduate school?, it’s what factors should the admissions committee attend to when making a decision? These may seem like the same thing, but they’re not. And the reason they’re not is that knowing which factors are predictive of success is no guarantee that faculty are actually going to be able to use that information in an appropriate way. Knowing what predicts performance is only half the story, as it were; you also need to know exactly how to weight different factors appropriately in order to generate an optimal prediction.

In practice, humans turn out to be incredibly bad at predicting outcomes based on multiple factors. An enormous literature on mechanical (or actuarial) prediction, which Sanjay mentions in his post, has repeatedly demonstrated that in many domains, human judgments are consistently and often substantially outperformed by simple regression equations. There are several reasons for this gap, but one of the biggest ones is that people are just shitty at quantitatively integrating multiple continuous variables. When you visit a car dealership, you may very well be aware that your long-term satisfaction with any purchase is likely to depend on some combination of horsepower, handling, gas mileage, seating comfort, number of cupholders, and so on. But the odds that you’ll actually be able to combine that information in an optimal way are essentially nil. Our brains are simply not designed to work that way; you can’t internally compute the value you’d get out of a car using an equation like 1.03*cupholders + 0.021*horsepower + 0.3*mileage. Some of us try to do it that way–e.g., by making very long pro and con lists detailing all the relevant factors we can possibly think of–but it tends not to work out very well (e.g., you total up the numbers and realize, hey, that’s not the answer I wanted! And then you go buy that antique ’68 Cadillac you had your eye on the whole time you were pretending to count cupholders in the Nissan Maxima).

Admissions committees face much the same problem. The trouble lies not so much in determining which factors predict graduate school success (or, for that matter, many other outcomes we care about in daily life), but in determining how to best combine them. Knowing that interview performance incrementally improves predictions is only useful if you can actually trust decision-makers to weight that variable very lightly relative to other more meaningful predictors like GREs and GPAs. And that’s a difficult proposition, because I suspect that admissions discussions rarely go like this:

Faculty Member 1: I think we should accept Candidate X. Her GREs are off the chart, great GPA, already has two publications.
Faculty Member 2: I didn’t like X at all. She didn’t seem very excited to be here.
FM1: Well, that doesn’t matter so much. Unless you really got a strong feeling that she wouldn’t stick it out in the program, it probably won’t make much of a difference, performance-wise.
FM2: Okay, fine, we’ll accept her.

And more often go like this:

FM1: Let’s take Candidate X. Her GREs are off the chart, great GPA, already has two publications.
FM2: I didn’t like X at all. She didn’t seem very excited to be here.
FM1: Oh, you thought so too? That’s kind of how I felt too, but I didn’t want to say anything.
FM2: Okay, we won’t accept X. We have plenty of other good candidates with numbers that are nearly as good and who seemed more pleasant.

Admittedly, I don’t have any direct evidence to back up this conjecture. Except that I think it would be pretty remarkable if academic faculty departed from experts in pretty much every other domain that’s been tested (clinical practice, medical diagnosis, criminal recidivism, etc.) and were actually able to do as well (or even close to as well) as a simple regression equation. For what it’s worth, in many of the studies of mechanical prediction, the human experts are explicitly given all of the information passed to the prediction equation, and still do relatively poorly. In other words, you can hand a clinical psychologist a folder full of quantitative information about a patient, tell them to weight it however they want, and even the best clinicians are still going to be outperformed by a mechanical prediction (if you doubt this to be true, I second Sanjay in directing you to Paul Meehl’s seminal body of work–truly some of the most important and elegant work ever done in psychology, and if you haven’t read it, you’re missing out). And in some sense, faculty members aren’t really even experts about admissions, since they only do it once a year. So I’m pretty skeptical that admissions committees actually manage to weight their firsthand personal experience with candidates appropriately when making their final decisions. It seems much more likely that any personality impressions they come away with will just tend to drown out prior assessments based on (relatively) objective data.

That all said, I couldn’t agree more with Sanjay’s ultimate conclusion, so I’ll just end with this quote:

That, of course, is a testable question. So if you are an evidence-based curmudgeon, you should probably want some relevant data. I was not able to find any studies that specifically addressed the importance of rapport and interest-matching as predictors of later performance in a doctoral program. (Indeed, validity studies of graduate admissions are few and far between, and the ones I could find were mostly for medical school and MBA programs, which are very different from research-oriented Ph.D. programs.) It would be worth doing such studies, but not easy.

Oh, except that I do want to add that I really like the phrase “evidence-based curmudgeon“, and I’m totally stealing it.