the truth is not optional: five bad reasons (and one mediocre one) for defending the status quo

You could be forgiven for thinking that academic psychologists have all suddenly turned into professional whistleblowers. Everywhere you look, interesting new papers are cropping up purporting to describe this or that common-yet-shady methodological practice, and telling us what we can collectively do to solve the problem and improve the quality of the published literature. In just the last year or so, Uri Simonsohn introduced new techniques for detecting fraud, and used those tools to identify at least 3 cases of high-profile, unabashed data forgery. Simmons and colleagues reported simulations demonstrating that standard exploitation of research degrees of freedom in analysis can produce extremely high rates of false positive findings. Pashler and colleagues developed a “Psych file drawer” repository for tracking replication attempts. Several researchers raised trenchant questions about the veracity and/or magnitude of many high-profile psychological findings such as John Bargh’s famous social priming effects. Wicherts and colleagues showed that authors of psychology articles who are less willing to share their data upon request are more likely to make basic statistical errors in their papers. And so on and so forth. The flood shows no signs of abating; just last week, the APS journal Perspectives in Psychological Science announced that it’s introducing a new “Registered Replication Report” section that will commit to publishing pre-registered high-quality replication attempts, irrespective of their outcome.

Personally, I think these are all very welcome developments for psychological science. They’re solid indications that we psychologists are going to be able to police ourselves successfully in the face of some pretty serious problems, and they bode well for the long-term health of our discipline. My sense is that the majority of other researchers–perhaps the vast majority–share this sentiment. Still, as with any zeitgeist shift, there are always naysayers. In discussing these various developments and initiatives with other people, I’ve found myself arguing, with somewhat surprising frequency, with people who for various reasons think it’s not such a good thing that Uri Simonsohn is trying to catch fraudsters, or that social priming findings are being questioned, or that the consequences of flexible analyses are being exposed. Since many of the arguments I’ve come across tend to recur, I thought I’d summarize the most common ones here–along with the rebuttals I usually offer for why, with one possible exception, the arguments for giving a pass to sloppy-but-common methodological practices are not very compelling.

“But everyone does it, so how bad can it be?”

We typically assume that long-standing conventions must exist for some good reason, so when someone raises doubts about some widespread practice, it’s quite natural to question the person raising the doubts rather than the practice itself. Could it really, truly be (we say) that there’s something deeply strange and misguided about using p values? Is it really possible that the reporting practices converged on by thousands of researchers in tens of thousands of neuroimaging articles might leave something to be desired? Could failing to correct for the many researcher degrees of freedom associated with most datasets really inflate the false positive rate so dramatically?

The answer to all these questions, of course, is yes–or at least, we should allow that it could be yes. It is, in principle, entirely possible for an entire scientific field to regularly do things in a way that isn’t very good. There are domains where appeals to convention or consensus make perfect sense, because there are few good reasons to do things a certain way except inasmuch as other people do them the same way. If everyone else in your country drives on the right side of the road, you may want to consider driving on the right side of the road too. But science is not one of those domains. In science, there is no intrinsic benefit to doing things just for the sake of convention. In fact, almost by definition, major scientific advances are ones that tend to buck convention and suggest things that other researchers may not have considered possible or likely.

In the context of common methodological practice, it’s no defense at all to say but everyone does it this way, because there are usually relatively objective standards by which we can gauge the quality of our methods, and it’s readily apparent that there are many cases where the consensus approach leave something to be desired. For instance, you can’t really justify failing to correct for multiple comparisons when you report a single test that’s just barely significant at p < .05 on the grounds that nobody else corrects for multiple comparisons in your field. That may be a valid explanation for why your paper successfully got published (i.e., reviewers didn’t want to hold your feet to the fire for something they themselves are guilty of in their own work), but it’s not a valid defense of the actual science. If you run a t-test on randomly generated data 20 times, you will, on average, get a significant result, p < .05, once. It does no one any good to argue that because the convention in a field is to allow multiple testing–or to ignore statistical power, or to report only p values and not effect sizes, or to omit mention of conditions that didn’t ‘work’, and so on–it’s okay to ignore the issue. There’s a perfectly reasonable question as to whether it’s a smart career move to start imposing methodological rigor on your work unilaterally (see below), but there’s no question that the mere presence of consensus or convention surrounding a methodological practice does not make that practice okay from a scientific standpoint.

“But psychology would break if we could only report results that were truly predicted a priori!”

This is a defense that has some plausibility at first blush. It’s certainly true that if you force researchers to correct for multiple comparisons properly, and report the many analyses they actually conducted–and not just those that “worked”–a lot of stuff that used to get through the filter will now get caught in the net. So, by definition, it would be harder to detect unexpected effects in one’s data–even when those unexpected effects are, in some sense, ‘real’. But the important thing to keep in mind is that raising the bar for what constitutes a believable finding doesn’t actually prevent researchers from discovering unexpected new effects; all it means is that it becomes harder to report post-hoc results as pre-hoc results. It’s not at all clear why forcing researchers to put in more effort validating their own unexpected finding is a bad thing.

In fact, forcing researchers to go the extra mile in this way would have one exceedingly important benefit for the field as a whole: it would shift the onus of determining whether an unexpected result is plausible enough to warrant pursuing away from the community as a whole, and towards the individual researcher who discovered the result in the first place. As it stands right now, if I discover an unexpected result (p < .05!) that I can make up a compelling story for, there’s a reasonable chance I might be able to get that single result into a short paper in, say, Psychological Science. And reap all the benefits that attend getting a paper into a “high-impact” journal. So in practice there’s very little penalty to publishing questionable results, even if I myself am not entirely (or even mostly) convinced that those results are reliable. This state of affairs is, to put it mildly, not A Good Thing.

In contrast, if you as an editor or reviewer start insisting that I run another study that directly tests and replicates my unexpected finding before you’re willing to publish my result, I now actually have something at stake. Because it takes time and money to run new studies, I’m probably not going to bother to follow up on my unexpected finding unless I really believe it. Which is exactly as it should be: I’m the guy who discovered the effect, and I know about all the corners I have or haven’t cut in order to produce it; so if anyone should make the decision about whether to spend more taxpayer money chasing the result, it should be me. You, as the reviewer, are not in a great position to know how plausible the effect truly is, because you have no idea how many different types of analyses I attempted before I got something to ‘work’, or how many failed studies I ran that I didn’t tell you about. Given the huge asymmetry in information, it seems perfectly reasonable for reviewers to say, You think you have a really cool and unexpected effect that you found a compelling story for? Great; go and directly replicate it yourself and then we’ll talk.

“But mistakes happen, and people could get falsely accused!”

Some people don’t like the idea of a guy like Simonsohn running around and busting people’s data fabrication operations for the simple reason that they worry that the kind of approach Simonsohn used to detect fraud is just not that well-tested, and that if we’re not careful, innocent people could get swept up in the net. I think this concern stems from fundamentally good intentions, but once again, I think it’s also misguided.

For one thing, it’s important to note that, despite all the press, Simonsohn hasn’t actually done anything qualitatively different from what other whistleblowers or skeptics have done in the past. He may have suggested new techniques that improve the efficiency with which cheating can be detected, but it’s not as though he invented the ability to report or investigate other researchers for suspected misconduct. Researchers suspicious of other researchers’ findings have always used qualitatively similar arguments to raise concerns. They’ve said things like, hey, look, this is a pattern of data that just couldn’t arise by chance, or, the numbers are too similar across different conditions.

More to the point, perhaps, no one is seriously suggesting that independent observers shouldn’t be allowed to raise their concerns about possible misconduct with journal editors, professional organizations, and universities. There really isn’t any viable alternative. Naysayers who worry that innocent people might end up ensnared by false accusations presumably aren’t suggesting that we do away with all of the existing mechanisms for ensuring accountability; but since the role of people like Simonsohn is only to raise suspicion and provide evidence (and not to do the actual investigating or firing), it’s clear that there’s no way to regulate this type of behavior even if we wanted to (which I would argue we don’t). If I wanted to spend the rest of my life scanning the statistical minutiae of psychology articles for evidence of misconduct and reporting it to the appropriate authorities (and I can assure you that I most certainly don’t), there would be nothing anyone could do to stop me, nor should there be. Remember that accusing someone of misconduct is something anyone can do, but establishing that misconduct has actually occurred is a serious task that requires careful internal investigation. No one–certainly not Simonsohn–is suggesting that a routine statistical test should be all it takes to end someone’s career. In fact, Simonsohn himself has noted that he identified a 4th case of likely fraud that he dutifully reported to the appropriate authorities only to be met with complete silence. Given all the incentives universities and journals have to look the other way when accusations of fraud are made, I suspect we should be much more concerned about the false negative rate than the false positive rate when it comes to fraud.

“But it hurts the public’s perception of our field!”

Sometimes people argue that even if the field does have some serious methodological problems, we still shouldn’t discuss them publicly, because doing so is likely to instill a somewhat negative view of psychological research in the public at large. The unspoken implication being that, if the public starts to lose confidence in psychology, fewer students will enroll in psychology courses, fewer faculty positions will be created to teach students, and grant funding to psychologists will decrease. So, by airing our dirty laundry in public, we’re only hurting ourselves. I had an email exchange with a well-known researcher to exactly this effect a few years back in the aftermath of the Vul et al “voodoo correlations” paper–a paper I commented on to the effect that the problem was even worse than suggested. The argument my correspondent raised was, in effect, that we (i.e., neuroimaging researchers) are all at the mercy of agencies like NIH to keep us employed, and if it starts to look like we’re clowning around, the unemployment rate for people with PhDs in cognitive neuroscience might start to rise precipitously.

While I obviously wouldn’t want anyone to lose their job or their funding solely because of a change in public perception, I can’t say I’m very sympathetic to this kind of argument. The problem is that it places short-term preservation of the status quo above both the long-term health of the field and the public’s interest. For one thing, I think you have to be quite optimistic to believe that some of the questionable methodological practices that are relatively widespread in psychology (data snooping, selective reporting, etc.) are going to sort themselves out naturally if we just look the other way and let nature run its course. The obvious reason for skepticism in this regard is that many of the same criticisms have been around for decades, and it’s not clear that anything much has improved. Maybe the best example of this is Gigerenzer and Sedlmeier’s 1989 paper entitled “Do studies of statistical power have an effect on the power of studies?“, in which the authors convincingly showed that despite three decades of work by luminaries like Jacob Cohen advocating power analyses, statistical power had not risen appreciably in psychology studies. The presence of such unwelcome demonstrations suggests that sweeping our problems under the rug in the hopes that someone (the mice?) will unobtrusively take care of them for us is wishful thinking.

In any case, even if problems did tend to solve themselves when hidden away from the prying eyes of the media and public, the bigger problem with what we might call the “saving face” defense is that it is, fundamentally, an abuse of taxypayers’ trust. As with so many other things, Richard Feynman summed up the issue eloquently in his famous Cargo Cult science commencement speech:

For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of this work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing–and if they don’t want to support you under those circumstances, then that’s their decision.

The fact of the matter is that our livelihoods as researchers depend directly on the goodwill of the public. And the taxpayers are not funding our research so that we can “discover” interesting-sounding but ultimately unreplicable effects. They’re funding our research so that we can learn more about the human mind and hopefully be able to fix it when it breaks. If a large part of the profession is routinely employing practices that are at odds with those goals, it’s not clear why taxpayers should be footing the bill. From this perspective, it might actually be a good thing for the field to revise its standards, even if (in the worst-case scenario) that causes a short-term contraction in employment.

“But unreliable effects will just fail to replicate, so what’s the big deal?”

This is a surprisingly common defense of sloppy methodology, maybe the single most common one. It’s also an enormous cop-out, since it pre-empts the need to think seriously about what you’re doing in the short term. The idea is that, since no single study is definitive, and a consensus about the reality or magnitude of most effects usually doesn’t develop until many studies have been conducted, it’s reasonable to impose a fairly low bar on initial reports and then wait and see what happens in subsequent replication efforts.

I think this is a nice ideal, but things just don’t seem to work out that way in practice. For one thing, there doesn’t seem to be much of a penalty for publishing high-profile results that later fail to replicate. The reason, I suspect, is that we incline to give researchers the benefit of the doubt: surely (we say to ourselves), Jane Doe did her best, and we like Jane, so why should we question the work she produces? If we’re really so skeptical about her findings, shouldn’t we go replicate them ourselves, or wait for someone else to do it?

While this seems like an agreeable and fair-minded attitude, it isn’t actually a terribly good way to look at things. Granted, if you really did put in your best effort–dotted all your i’s and crossed all your t’s–and still ended up reporting a false result, we shouldn’t punish you for it. I don’t think anyone is seriously suggesting that researchers who inadvertently publish false findings should be ostracized or shunned. On the other hand, it’s not clear why we should continue to celebrate scientists who ‘discover’ interesting effects that later turn out not to replicate. If someone builds a career on the discovery of one or more seemingly important findings, and those findings later turn out to be wrong, the appropriate attitude is to update our beliefs about the merit of that person’s work. As it stands, we rarely seem to do this.

In any case, the bigger problem with appeals to replication is that the delay between initial publication of an exciting finding and subsequent consensus disconfirmation can be very long, and often spans entire careers. Waiting decades for history to prove an influential idea wrong is a very bad idea if the available alternative is to nip the idea in the bud by requiring stronger evidence up front.

There are many notable examples of this in the literature. A well-publicized recent one is John Bargh’s work on the motor effects of priming people with elderly stereotypes–namely, that priming people with words related to old age makes them walk away from the experiment more slowly. Bargh’s original paper was published in 1996, and according to Google Scholar, has now been cited over 2,000 times. It has undoubtedly been hugely influential in directing many psychologists’ research programs in certain directions (in many cases, in directions that are equally counterintuitive and also now seem open to question). And yet it’s taken over 15 years for a consensus to develop that the original effect is at the very least much smaller in magnitude than originally reported, and potentially so small as to be, for all intents and purposes, “not real”. I don’t know who reviewed Bargh’s paper back in 1996, but I suspect that if they ever considered the seemingly implausible size of the effect being reported, they might have well thought to themselves, well, I’m not sure I believe it, but that’s okay–time will tell. Time did tell, of course; but time is kind of lazy, so it took fifteen years for it to tell. In an alternate universe, a reviewer might have said, well, this is a striking finding, but the effect seems implausibly large; I would like you to try to directly replicate it in your lab with a much larger sample first. I recognize that this is onerous and annoying, but my primary responsibility is to ensure that only reliable findings get into the literature, and inconveniencing you seems like a small price to pay. Plus, if the effect is really what you say it is, people will be all the more likely to believe you later on.

Or take the actor-observer asymmetry, which appears in just about every introductory psychology textbook written in the last 20 – 30 years. It states that people are relatively more likely to attribute their own behavior to situational factors, and relatively more likely to attribute other agents’ behaviors to those agents’ dispositions. When I slip and fall, it’s because the floor was wet; when you slip and fall, it’s because you’re dumb and clumsy. This putative asymmetry was introduced and discussed at length in a book by Jones and Nisbett in 1971, and hundreds of studies have investigated it at this point. And yet a 2006 meta-analysis by Malle suggested that the cumulative evidence for the actor-observer asymmetry is actually very weak. There are some specific circumstances under which you might see something like the postulated effect, but what is quite clear is that it’s nowhere near strong enough an effect to justify being routinely invoked by psychologists and even laypeople to explain individual episodes of behavior. Unfortunately, at this point it’s almost impossible to dislodge the actor-observer asymmetry from the psyche of most researchers–a reality underscored by the fact that the Jones and Nisbett book has been cited nearly 3,000 times, whereas the 1996 meta-analysis has been cited only 96 times (a very low rate for an important and well-executed meta-analysis published in Psychological Bulletin).

The fact that it can take many years–whether 15 or 45–for a literature to build up to the point where we’re even in a position to suggest with any confidence that an initially exciting finding could be wrong means that we should be very hesitant to appeal to long-term replication as an arbiter of truth. Replication may be the gold standard in the very long term, but in the short and medium term, appealing to replication is a huge cop-out. If you can see problems with an analysis right now that cast aspersions on a study’s results, it’s an abdication of responsibility to downplay your concerns and wait for someone else to come along and spend a lot more time and money trying to replicate the study. You should point out now why you have concerns. If the authors can address them, the results will look all the better for it. And if the authors can’t address your concerns, well, then, you’ve just done science a service. If it helps, don’t think of it as a matter of saying mean things about someone else’s work, or of asserting your own ego; think of it as potentially preventing a lot of very smart people from wasting a lot of time chasing down garden paths–and also saving a lot of taxpayer money. Remember that our job as scientists is not to make other scientists’ lives easy in the hopes they’ll repay the favor when we submit our own papers; it’s to establish and apply standards that produce convergence on the truth in the shortest amount of time possible.

“But it would hurt my career to be meticulously honest about everything I do!”

Unlike the other considerations listed above, I think the concern that being honest carries a price when it comes to do doing research has a good deal of merit to it. Given the aforementioned delay between initial publication and later disconfirmation of findings (which even in the best case is usually longer than the delay between obtaining a tenure-track position and coming up for tenure), researchers have many incentives to emphasize expediency and good story-telling over accuracy, and it would be disingenuous to suggest otherwise. No malevolence or outright fraud is implied here, mind you; the point is just that if you keep second-guessing and double-checking your analyses, or insist on routinely collecting more data than other researchers might think is necessary, you will very often find that results that could have made a bit of a splash given less rigor are actually not particularly interesting upon careful cross-examination. Which means that researchers who have, shall we say, less of a natural inclination to second-guess, double-check, and cross-examine their own work will, to some degree, be more likely to publish results that make a bit of a splash (it would be nice to believe that pre-publication peer review filters out sloppy work, but empirically, it just ain’t so). So this is a classic tragedy of the commons: what’s good for a given individual, career-wise, is clearly bad for the community as a whole.

I wish I had a good solution to this problem, but I don’t think there are any quick fixes. The long-term solution, as many people have observed, is to restructure the incentives governing scientific research in such a way that individual and communal benefits are directly aligned. Unfortunately, that’s easier said than done. I’ve written a lot both in papers (1, 2, 3) and on this blog (see posts linked here) about various ways we might achieve this kind of realignment, but what’s clear is that it will be a long and difficult process. For the foreseeable future, it will continue to be an understandable though highly lamentable defense to say that the cost of maintaining a career in science is that one sometimes has to play the game the same way everyone else plays the game, even if it’s clear that the rules everyone plays by are detrimental to the communal good.

 

Anyway, this may all sound a bit depressing, but I really don’t think it should be taken as such. Personally I’m actually very optimistic about the prospects for large-scale changes in the way we produce and evaluate science within the next few years. I do think we’re going to collectively figure out how to do science in a way that directly rewards people for employing research practices that are maximally beneficial to the scientific community as a whole. But I also think that for this kind of change to take place, we first need to accept that many of the defenses we routinely give for using iffy methodological practices are just not all that compelling.

the seedy underbelly

This is fiction. Science will return shortly.


Cornelius Kipling doesn’t take No for an answer. He usually takes several of them–several No’s strung together in rapid sequence, each one louder and more adamant than the last one.

“No,” I told him over dinner at the Rhubarb Club one foggy evening. “No, no, no. I won’t bankroll your efforts to build a new warp drive.”

“But the last one almost worked,” Kip said pleadingly. “I almost had it down before the hull gave way.”

I conceded that it was a clever idea; everyone before Kip had always thought of warp drives as something you put on spaceships. Kip decided to break the mold by placing one on a hydrofoil. Which, naturally, made the boat too heavy to rise above the surface of the water. In fact, it made the boat too heavy to do anything but sink.

“Admittedly, the sinking thing is a small problem,” he said, as if reading my thoughts. “But I’m working on a way to adjust for the extra weight and get it to rise clear out of the water.”

“Good,” I said. “Because lifting the boat out of the water seems like a pretty important step on the road to getting it to travel through space at light speed.”

“Actually, it’s the only remaining technical hurdle,” said Kip. “Once it’s out of the water, everything’s already taken care of. I’ve got onboard fission reactors for power, and a tentative deal to use the International Space Station for supplies. Virgin Galactic is ready to license the technology as soon as we pull off a successful trial run. And there’s an arrangement with James Cameron’s new asteroid mining company to supply us with fuel as we boldly go where… well, you know.”

“Right,” I said, punching my spoon into my crème brûlée in frustration. The crème brûlée retaliated by splattering itself all over my face and jacket.

“See, this kind of thing wouldn’t happen to you if you invested in my company,” Kip helpfully suggested as he passed me an extra napkin. “You’d have so much money other people would feed you. People with ten or fifteen years of experience wielding dessert spoons.”


After dinner we headed downtown. Kip said there was a new bar called Zygote he wanted to show me.

“Actually, it’s not a new bar per se,” he explained as we were leaving the Rhubarb. “It’s new to me. Turns out it’s been here for several years, but you have to know someone to get in. And that someone has to be willing to sponsor you. They review your biography, look up your criminal record, make sure you’re the kind of person they want at the bar, and so on.”

“Sounds like an arranged marriage.”

“You’re not too far off. When you’re first accepted as a member, you’re supposed to give Zygote a dowry of $2,000.”

“That’s a joke, right?” I asked.

“Yes. There’s no dowry. Just the fee.”

“Two thousand dollars? Really?”

“Well, more like fifty a year. But same principle.”

We walked down the mall in silence. I could feel the insoles of my shoes wrapping themselves around my feet, and I knew they were desperately warning me to get away from Kip while I still had a limited amount of sobriety and dignity left.

“How would anyone manage to keep a place like that secret?” I asked. “Especially on the mall.”

“They hire hit men,” Kip said solemnly.

I suspected he was joking, but couldn’t swear to it. I mean, if you didn’t know Kip, you would probably have thought that the idea of putting a warp drive on a hydrofoil was also a big joke.

Kip led us into one of the alleys off Pearl Street, where he quickly located an unobtrusive metal panel set into the wall just below eye level. The panel opened inwards when we pushed it. Behind the panel, we found a faint smell of old candles and a flight of stairs. At the bottom of the stairs–which turned out to run three stories down–we came to another door. This one didn’t open when we pushed it. Instead, Kip knocked on it three times. Then twice more. Then four times.

“Secret code?” I asked.

“No. Obsessive-compulsive disorder.”

The door swung open.

“Evening, Ashraf,” Kip said to the doorman as we stepped through. Ashraf was a tiny Middle Eastern man, very well dressed. Suede pants, cashmere scarf, fedora on his head. Feather in the fedora. The works. I guess when your bar is located behind a false wall three stories below grade, you don’t really need a lot of muscle to keep the peasants out; you knock them out with panache.

“Welcome to Zygote,” Ashraf said. His bland tone made it clear that, truthfully, he wasn’t at all interested in welcoming anyone anywhere. Which made him exactly the kind of person an establishment like this would want as its doorman.

Inside, the bar was mostly empty. There were twelve or fifteen patrons scattered across various booths and animal-print couches. They all took great care not to make eye contact with us as we entered.

“I have to confess,” I whispered to Kip as we made our way to the bar. “Until about three seconds ago, I didn’t really believe you that this place existed.”

“No worries,” he said. “Until about three seconds ago, it had no idea you existed either.”

He looked around.

“Actually, I’m still not sure it knows you exist,” he added apologetically.

“I feel like I’m giving everyone the flu just by standing here,” I told him.

We took a seat at the end of the bar and motioned to the bartender, who looked to be high on a designer drug chemically related to apathy. She eventually wandered over to us–but not before stopping to inspect the countertop, a stack of coasters with pictures of archaeological sites on them, a rack of brandy snifters, and the water running from the faucet.

“Two mojitos and a strawberry daiquiri,” Kip said when she finally got close enough to yell at.

“Who’s the strawberry daiquiri for,” I asked.

“Me. They’re all for me. Why, did you want a drink too?”

I did, so I ordered the special–a pink cocktail called a Flamingo. Each Flamingo came in a tall Flamingo-shaped glass that couldn’t stand up by itself, so you had to keep holding it until you finished it. Once you were done, you could lay the glass on its side on the counter and watch it leak its remaining pink guts out onto the tile. This act was, I gathered from Kip, a kind of rite of passage at Zygote.

“This is a very fancy place,” I said to no one in particular.

“You should have seen it before the gang fights,” the bartender said before walking back to the snifter rack. I had high hopes she would eventually get around to filling our order.

“Gang fights?”

“Yes,” Kip said. “Gang fights. Used to be big old gang fights in here every other week. They trashed the place several times.”

“It’s like there’s this whole seedy underbelly to Boulder that I never knew existed.”

“Oh, this is nothing. It goes much deeper than this. You haven’t seen the seedy underbelly of this place until you’ve tried to convince a bunch of old money hippies to finance your mass-produced elevator-sized vaporizer. You haven’t squinted into the sun or tasted the shadow of death on your shoulder until you’ve taken on the Bicycle Triads of North Boulder single-file in a dark alley. And you haven’t tried to scratch the dirt off your soul–unsuccessfully, mind you–until you’ve held all-night bargaining sessions with local black hat hacker groups to negotiate the purchase of mission-critical zero-day exploits.”

“Well, that may all be true,” I said. “But I don’t think you’ve done any of those things either.”

I should have known better than to question Kip’s credibility; he spent the next fifteen minutes reminding me of the many times he’d risked his life, liberty, and (nonexistent) fortune fighting to suppress the darkest forces in Northern Colorado in the service of the greater good of mankind.

After that, he launched into his standard routine of trying to get me to buy into the latest round of his inane startup ideas. He told me, in no particular order, about his plans to import, bottle and sell the finest grade Kazakh sand as a replacement for the substandard stuff currently found on American kindergarten sandlots; to run a “reverse tourism” operation that would fly in members of distant cultures to visit disabled would-be travelers in the comfort of their own living rooms (tentative slogan: if the customer can’t come to Muhammad, Muhammad must come to the customer); and to create giant grappling hooks that could pull Australia closer to the West Coast so that Kip could speculate in airline stocks and make billions of dollars once shorter flights inevitably caused Los Angeles-Sydney routes to triple in passenger volume.

I freely confess that my recollection of the finer points of the various revenue enhancement plans Kip proposed that night is not the best. I was a little bit distracted by a woman at the far end of the bar who kept gesturing towards me the whole time Kip was talking. Actually, she wasn’t so much gesturing towards me as gently massaging her neck. But she only did it when I happened to look at her. At one point, she licked her index finger and rubbed it on her neck, giving me a pointed look.

After about forty-five minutes of this, I finally worked up the courage to interrupt Kip’s explanation of how and why the federal government could solve all of America’s economic problems overnight by convincing Balinese children to invest in discarded high school football uniforms.

“Look,” I told him, pointing down to the other side of the bar. “You see? This is why I don’t go to bars any more now that I’m married. Attractive women hit on me, and I hate to disappoint them.”

I raised my left hand and deliberately stroked my wedding band in full view.

The lady at the far end didn’t take the hint. Quite the opposite; she pushed back her bar stool and came over to us.

“Christ,” I whispered.

Kip smirked quietly.

“Hi,” said the woman. “I’m Suzanne.”

“Hi,” I said. “I’m flattered. And also married.”

“I see that. I also see that you have some food in your… neckbeard. It looks like whipped cream. At least I hope that’s what it is. I was trying to let you know from down there, so you could wipe it off without embarrassing yourself any further. But apparently you’d rather embarrass yourself.”

“It’s crème brûlée,” I mumbled.

“Weak,” said Suzanne, turning around. “Very weak.”

After she’d left, I wiped my neck on my sleeve and looked at Kip. He looked back at me with a big grin on his face.

“I don’t suppose the thought crossed your mind, at any point in the last hour, to tell me I had crème brûlée in my beard.”

“You mean your neckbeard?”

“Yes,” I sighed, making a mental note to shave more often. “That.”

“It certainly crossed my mind,” Kip said. “Actually, it crossed my mind several times. But each time it crossed, it just waved hello and kept right on going.”

“You know you’re an asshole, right?”

“Whatever you say, Captain Neckbeard.”

“Alright then,” I sighed. “Let’s get out of here. It’s past my curfew anyway. Do you remember where I left my car?”

“No need,” said Kip, putting on his jacket and clapping his hand to my shoulder. “My hydrofoil’s parked in the Spruce lot around the block. The new warp drive is in. Walk with me and I’ll give you a ride. As long as you don’t mind pushing for the first fifty yards.”

the Neurosynth viewer goes modular and open source

If you’ve visited the Neurosynth website lately, you may have noticed that it looks… the same way it’s always looked. It hasn’t really changed in the last ~20 months, despite the vague promise on the front page that in the next few months, we’re going to do X, Y, Z to improve the functionality. The lack of updates is not by design; it’s because until recently I didn’t have much time to work on Neurosynth. Now that much of my time is committed to the project, things are moving ahead pretty nicely, though the changes behind the scenes aren’t reflected in any user-end improvements yet.

The github repo is now regularly updated and even gets the occasional contribution from someone other than myself; I expect that to ramp up considerably in the coming months. You can already use the code to run your own automated meta-analyses fairly easily; e.g., with everything set up right (follow the Readme and examples in the repo), the following lines of code:

dataset = cPickle.load(open('dataset.pkl', 'rb'))
studies = get_ids_by_expression("memory* &~ ("wm|working|episod*"), threshold=0.001)
ma = meta.MetaAnalysis(dataset, studies)
ma.save_results('memory')

…will perform an automated meta-analysis of all studies in the Neurosynth database that use the term ‘memory’ at a frequency of 1 in 1,000 words or greater, but don’t use the terms wm or working, or words that start with ‘episod’ (e.g., episodic). You can perform queries that nest to arbitrary depths, so it’s a pretty powerful engine for quickly generating customized meta-analyses, subject to all of the usual caveats surrounding Neurosynth (i.e., that the underlying data are very noisy, that terms aren’t mental states, etc.).

Anyway, with the core tools coming along, I’ve started to turn back to other elements of the project, starting with the image viewer. Yesterday I pushed the first commit of a new version of the viewer that’s currently on the Neurosynth website. In the next few weeks, this new version will be replacing the current version of the viewer, along with a bunch of other changes to the website.

A live demo of the new viewer is available here. It’s not much to look at right now, but behind the scenes, it’s actually a huge improvement on the old viewer in a number of ways:

  • The code is completely refactored and is all nice and object-oriented now. It’s also in CoffeeScript, which is an alternative and (if you’re coming from a Python or Ruby background) much more readable syntax for JavaScript. The source code is on github and contributions are very much encouraged. Like most scientists, I’m generally loathe to share my code publicly because I think it sucks most of the time. But I actually feel pretty good about this code. It’s not good code by any stretch, but I think it rises to the level of ‘mostly sensible’, which is about as much as I can hope for.
  • The viewer now handles multiple layers simultaneously, with the ability to hide and show layers, reorder them by dragging, vary the transparency, assign different color palettes, etc. These features have been staples of offline viewers pretty much since the prehistoric beginnings of fMRI time, but they aren’t available in the current Neurosynth viewer or most other online viewers I’m aware of, so this is a nice addition.
  • The architecture is modular, so that it should be quite easy in future to drop in other alternative views onto the data without having to muck about with the app logic. E.g., adding a 3D WebGL-based view to complement the current 2D slice-based HTML5 canvas approach is on the near-term agenda.
  • The resolution of the viewer is now higher–up from 4 mm to 2 mm (which is the most common native resolution used in packages like SPM and FSL). The original motivation for downsampling to 4 mm in the prior viewer was to keep filesize to a minimum and speed up the initial loading of images. But at some point I realized, hey, we’re living in the 21st century; people have fast internet connections now. So now the files are all in 2 mm resolution, which has the unpleasant effect of increasing file sizes by a factor of about 8, but also has the pleasant effect of making it so that you can actually tell what the hell you’re looking at.

Most importantly, there’s now a clean, and near-complete, separation between the HTML/CSS content and the JavaScript code. Which means that you can now effectively drop the viewer into just about any HTML page with just a few lines of code. So in theory, you can have basically the same viewer you see in the demo just by sticking something like the following into your page:

 viewer = Viewer.get('#layer_list', '.layer_settings')
 viewer.addView('#view_axial', 2);
 viewer.addView('#view_coronal', 1);
 viewer.addView('#view_sagittal', 0);
 viewer.addSlider('opacity', '.slider#opacity', 'horizontal', 'false', 0, 1, 1, 0.05);
 viewer.addSlider('pos-threshold', '.slider#pos-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addSlider('neg-threshold', '.slider#neg-threshold', 'horizontal', 'false', 0, 1, 0, 0.01);
 viewer.addColorSelect('#color_palette');
 viewer.addDataField('voxelValue', '#data_current_value')
 viewer.addDataField('currentCoords', '#data_current_coords')
 viewer.loadImageFromJSON('data/MNI152.json', 'MNI152 2mm', 'gray')
 viewer.loadImageFromJSON('data/emotion_meta.json', 'emotion meta-analysis', 'bright lights')
 viewer.loadImageFromJSON('data/language_meta.json', 'language meta-analysis', 'hot and cold')
 viewer.paint()

Well, okay, there are some other dependencies and styling stuff you’re not seeing. But all of that stuff is included in the example folder here. And of course, you can modify any of the HTML/CSS you see in the example; the whole point is that you can now easily style the viewer however you want it, without having to worry about any of the app logic.

What’s also nice about this is that you can easily pick and choose which of the viewer’s features you want to include in your page; nothing will (or at least, should) break no matter what you do. So, for example, you could decide you only want to display a single view showing only axial slices; or to allow users to manipulate the threshold of layers but not their opacity; or to show the current position of the crosshairs but not the corresponding voxel value; and so on. All you have to do is include or exclude the various addSlider() and addData() lines you see above.

Of course, it wouldn’t be a mediocre open source project if it didn’t have some important limitations I’ve been hiding from you until near the very end of this post (hoping, of course, that you wouldn’t bother to read this far down). The biggest limitation is that the viewer expects images to be in JSON format rather than a binary format like NIFTI or Analyze. This is a temporary headache until I or someone else can find the time and motivation to adapt one of the JavaScript NIFTI readers that are already out there (e.g., Satra Ghosh‘s parser for xtk), but for now, if you want to load your own images, you’re going to have to take the extra step of first converting them to JSON. Fortunately, the core Neurosynth Python package has a img_to_json() method in the imageutils module that will read in a NIFTI or Analyze volume and produce a JSON string in the expected format. Although I’m pretty sure it doesn’t handle orientation properly for some images, so don’t be surprised if your images look wonky. (And more importantly, if you fix the orientation issue, please commit your changes to the repo.)

In any case, as long as you’re comfortable with a bit of HTML/CSS/JavaScript hacking, the example/ folder in the github repo has everything you need to drop the viewer into your own pages. If you do use this code internally, please let me know! Partly for my own edification, but mostly because when I write my annual progress reports to the NIH, it’s nice to be able to truthfully say, “hey, look, people are actually using this neat thing we built with taxpayer money.”

several half-truths, and one blatant, unrepenting lie about my recent whereabouts

Apparently time does a thing that is much like flying. Seems like just yesterday I was sitting here in this chair, sipping on martinis, and pleasantly humming old show tunes while cranking out several high-quality blog posts an hour a mediocre blog post every week or two. But then! Then I got distracted! And blinked! And fell asleep in my chair! And then when I looked up again, 8 months had passed! With no blog posts!

Granted, on the Badness Scale, which runs from 1 to Imminent Apocalypse, this one clocks in at a solid 1.04. But still, eight months is a long time to be gone–about 3,000 internet years. So I figured I’d write a short post about the events of the past eight months before setting about the business of trying (and perhaps failing) to post here more regularly. Also, to keep things interesting, I’ve thrown in one fake bullet. See if you can spot the impostor.

  • I started my own lab! You can tell it’s a completely legitimate scientific operation because it has (a) a fancy new website, (b) other members besides me (some of whom I admittedly had to coerce into ‘joining’), and (c) weekly meetings. (As far as I can tell, these are all the necessary requirements for official labhood.) I decided to call my very legitimate scientific lab the Psychoinformatics Lab. Partly because I like how it sounds, and partly because it’s vaguely descriptive of the research I do. But mostly because it results in a catchy abbreviation: PILab. (It’s pronounced Pieeeeeeeeeee lab–the last 10 e’s are silent.)
  • I’ve been slowly writing and re-writing the Neurosynth codebase. Neurosynth is a thing made out of software that lets neuroimaging researchers very crudely stitch together one giant brain image out of other smaller brain images. It’s kind of like a collage, except that unlike most collages, in this case the sum is usually not more than its parts. In fact, the sum tends to look a lot like its parts. In any case, with some hard work and a very large serving of good luck, I managed to land a R01 grant from the NIH last summer, which will allow me to continue stitching images for a few more years. From my perspective, this is a very good thing, for two reasons. FIrst, because it means I’m not unemployed right now (I’m a big fan of employment, you see); and secondly, because I’m finding the stitching surprisingly enjoyable. If you enjoy stitching software into brain images, please help out.
  • I published a bunch of papers in 2012, so, according to my CV at least, it was a good year for me professionally. Actually, I think it was a deceptively good year–meaning, I don’t think I did any more work than I did in previous years, but various factors (old projects coming to fruition, a bunch of papers all getting accepted at the same time, etc.) conspired to produce more publications in 2012. This kind of stuff has a tendency to balance out in fairly short order though, so I fully expect to rack up a grand total of zero publications in 2013.
  • I went to Iceland! And England! And France! And Germany! And the Netherlands! And Canada! And Austin, Texas! Plus some other places. I know many people spend a lot of their time on the road and think hopping across various oceans is no big deal, but, well, it is to me, so BACK OFF. Anyway, it’s been nice to have the opportunity to travel more. And to combine business and pleasure. I am not one of those people–I think you call them ‘sane’–who prefer to keep their work life and their personal life cleanly compartmentalized, and try to cram all their work into specific parts of the year and then save a few days or weeks here and there to do nothing but roll around on the beach or ski down frighteningly tall mountains. I find I’m happiest when I get to spend one part of the day giving a talk or meeting with some people to discuss the way the edges of the brain blur when you shake your head, and then another part of the day roaming around De Jordaan asking passers-by, in a stilted Dutch, “where can I find some more of those baby cheeses?”
  • On a more personal note (as the archives of this blog will attest, I have no shame when it comes to publicly divulging embarrassing personal details), my wife and I celebrated our fifth anniversary a few weeks ago. I think this one is called the congratulations, you haven’t killed each other yet! anniversary. Next up: the ten year anniversary, also known as the but seriously, how are you both still alive? decennial. Fortunately we’re not particularly sentimental people, so we celebrated our wooden achievement with some sushi, some sake, and only 500 of our closest friends an early bedtime (no, seriously–we went to bed early; that’s not a euphemism for anything).
  • I contracted a bad case of vampirism while doing some prospecting work in the Yukon last summer. The details are a little bit sketchy, but I have a vague suspicion it happened on that one occasion when I was out gold panning in the middle of the night under a full moon and was brutally attacked by a man-sized bat that bit me several times on the neck. At least, that’s my best guess. But, whatever–now that my disease is in full bloom, it’s not so bad any more. I’ve become mostly nocturnal, and I have to snack on the blood of an unsuspecting undergraduate student once every month or two to keep from wasting away. But it seems like a small price to pay in return for eternal life, superhuman strength, and really pasty skin.
  • Overall, I’m enjoying myself quite a bit. I recently read somewhere that people are, on average, happiest in their 30s. I also recently read somewhere else that people are, on average, least happy in their 30s. I resolve this apparent contradiction by simply opting to believe the first thing, because in my estimation, I am, on average, happiest in my 30s.

Ok, enough self-indulgent rambling. Looking over this list, it wasn’t even a very eventful eight months, so I really have no excuse for dropping the ball on this blogging thing. I will now attempt to resume posting one to two posts a month about brain imaging, correlograms, and schweizel units. This might be a good cue for you to hit the UNSUBSCRIBE button.

unconference in Leipzig! no bathroom breaks!

Südfriedhof von Leipzig [HDR]

Many (most?) regular readers of this blog have probably been to at least one academic conference. Some of you even have the misfortune of attending conferences regularly. And a still-smaller fraction of you scholarly deviants might conceivably even enjoy the freakish experience. You know, that whole thing where you get to roam around the streets of some fancy city for a few days seeing old friends, learning about exciting new scientific findings, and completely ignoring the manuscripts and reviews piling up on your desk in your absence. It’s a loathsome, soul-scorching experience. Unfortunately it’s part of the job description for most scientists, so we shoulder the burden without complaining too loudly to the government agencies that force us to go to these things.

This post, thankfully, isn’t about a conference. In fact, it’s about the opposite of a conference, which is… an UNCONFERENCE. An unconference is a social event type of thing that strips away all of the unpleasant features of a regular conference–you know, the fancy dinners, free drinks, and stimulating conversation–and replaces them with a much more authentic academic experience. An authentic experience in which you spend the bulk of your time situated in a 10′ x 10′ room (3 m x 3 m for non-Imperialists) with 10 – 12 other academics, and no one’s allowed to leave the room, eat anything, or take bathroom breaks until someone in the room comes up with a brilliant discovery and wins a Nobel prize. This lasts for 3 days (plus however long it takes for the Nobel to be awarded), and you pay $1200 for the privilege ($1160 if you’re a post-doc or graduate student). Believe me when I tell you that it’s a life-changing experience.

Okay, I exaggerate a bit. Most of those things aren’t true. Here’s one explanation of what an unconference actually is:

An unconference is a participant-driven meeting. The term “unconference” has been applied, or self-applied, to a wide range of gatherings that try to avoid one or more aspects of a conventional conference, such as high fees, sponsored presentations, and top-down organization. For example, in 2006, CNNMoney applied the term to diverse events including Foo Camp, BarCamp, Bloggercon, and Mashup Camp.

So basically, my description was accurate up until the part where I said there were no bathroom breaks.

Anyway, I’m going somewhere with this, I promise. Specifically, I’m going to Leipzig, Germany! In September! And you should come too!

The happy occasion is Brainhack 2012, an unconference organized by the creative minds over at the Neuro Bureau–coordinators of such fine projects as the Brain Art Competition at OHBM (2012 incarnation going on in Beijing right now!) and the admittedly less memorable CNS 2007 Surplus Brain Yard Sale (guess what–turns out selling human brains out of the back of an unmarked van violates all kinds of New York City ordinances!).

Okay, as you can probably tell, I don’t quite have this event promotion thing down yet. So in the interest of ensuring that more than 3 people actually attend this thing, I’ll just shut up now and paste the official description from the Brainhack website:

The Neuro Bureau is proud to announce the 2012 Brainhack, to be held from September 1-4 at the Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany.

Brainhack 2012 is a unique workshop with the goals of fostering interdisciplinary collaboration and open neuroscience. The structure builds from the concepts of an unconference and a hackathon: The term “unconference” refers to the fact that most of the content will be dynamically created by the participants — a hackathon is an event where participants collaborate intensively on science-related projects.

Participants from all disciplines related to neuroimaging are welcome. Ideal participants span in range from graduate students to professors across any disciplines willing to contribute (e.g., mathematics, computer science, engineering, neuroscience, psychology, psychiatry, neurology, medicine, art, etc“¦). The primary requirement is a desire to work in close collaborations with researchers outside of your specialization in order to address neuroscience questions that are beyond the expertise of a single discipline.

In all seriousness though, I think this will be a blast, and I’m really looking forward to it. I’m contributing the full Neurosynth dataset as one of the resources participants will have access to (more on that in a later post), and I’m excited to see what we collectively come up with. I bet it’ll be at least three times as awesome as the Surplus Brain Yard Sale–though maybe not quite as lucrative.

 

 

p.s. I’ll probably also be in Amsterdam, Paris, and Geneva in late August/early September; if you live in one of these fine places and want to show me around, drop me an email. I’ll buy you lunch! Well, except in Geneva. If you live in Geneva, I won’t buy you lunch, because I can’t afford lunch in Geneva. You’ll buy yourself a nice Swiss lunch made of clockwork and gold, and then maybe I’ll buy you a toothpick.

R, the master troll of statistical languages

Warning: what follows is a somewhat technical discussion of my love-hate relationship with the R statistical language, in which I somehow manage to waste 2,400 words talking about a single line of code. Reader discretion is advised.

I’ve been using R to do most of my statistical analysis for about 7 or 8 years now–ever since I was a newbie grad student and one of the senior grad students in my lab introduced me to it. Despite having spent hundreds (thousands?) of hours in R, I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.

That said, if I’ve learned one thing about R, it’s that R is all about flexibility: almost any task can be accomplished in a dozen different ways. I don’t mean that in the trivial sense that pretty much any substantive programming problem can be solved in any number of ways in just about any language; I mean that for even very simple and well-defined tasks involving just one or two lines of code there are often many different approaches.

To illustrate, consider the simple task of selecting a column from a data frame (data frames in R are basically just fancy tables). Suppose you have a dataset that looks like this:

In most languages, there would be one standard way of pulling columns out of this table. Just one unambiguous way: if you don’t know it, you won’t be able to work with data at all, so odds are you’re going to learn it pretty quickly. R doesn’t work that way. In R there are many ways to do almost everything, including selecting a column from a data frame (one of the most basic operations imaginable!). Here are four of them:

 

I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).

This type of flexibility enables incredibly powerful, terse code once you know R reasonably well; unfortunately, it also makes for an extremely steep learning curve. You might wonder why that would be–after all, at its core, R still lets you do things the way most other languages do them. In the above example, you don’t have to use anything other than the simple index-based approach (i.e., data[,1]), which is the way most other languages that have some kind of data table or matrix object (e.g., MATLAB, Python/NumPy, etc.) would prefer you to do it. So why should the extra flexibility present any problems?

The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–and nothing is more frustrating to a newbie when learning a language than trying to figure out why sometimes people select columns in a data frame by index and other times they select them by name, or why sometimes people refer to named properties with a dollar sign and other times they wrap them in a vector or double square brackets. There are good reasons to have all of these different idioms, but you wouldn’t know that if you’re new to R and your expectation, quite reasonably, is that if two expressions look very different, they should do very different things. The flexibility that experienced R users love is very confusing to a newcomer. Most other languages don’t have that problem, because there’s only one way to do everything (or at least, far fewer ways than in R).

Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all. But I was reminded of the steepness of that initial learning curve the other day while helping my wife use R to do some regression analyses for her thesis. Rather than explaining what she was doing, suffice it to say that she needed to write a function that, among other things, takes a data frame as input and retains only the numeric columns for subsequent analysis. Data frames in R are actually lists under the hood, so they can have mixed types (i.e., you can have string columns and numeric columns and factors all in the same data frame; R lists basically work like hashes or dictionaries in other loosely-typed languages like Python or Ruby). So you can run into problems if you haphazardly try to perform numerical computations on non-numerical columns (e.g., good luck computing the mean of ‘cat’, ‘dog’, and ‘giraffe’), and hence, pre-emptive selection of only the valid numeric columns is required.

Now, in most languages (including R), you can solve this problem very easily using a loop. In fact, in many languages, you would have to use an explicit for-loop; there wouldn’t be any other way to do it. In R, you might do it like this*:

numeric_cols = rep(FALSE, ncol(ice.cream))
for (i in 1:ncol(ice.cream)) numeric_cols[i] = is.numeric(ice.cream[,i])

We allocate memory for the result, then loop over each column and check whether or not it’s numeric, saving the result. Once we’ve done that, we can select only the numeric columns from our data frame with data[,numeric_cols].

This is a perfectly sensible way to solve the problem, and as you can see, it’s not particularly onerous to write out. But of course, no self-respecting R user would write an explicit loop that way, because R provides you with any number of other tools to do the job more efficiently. So instead of saying “just loop over the columns and check if is.numeric() is true for each one,” when my wife asked me how to solve her problem, I cleverly said “use apply(), of course!”

apply() is an incredibly useful built-in function that implicitly loops over one or more margins of a matrix; in theory, you should be able to do the same work as the above two lines of code with just the following one line:

apply(ice.cream, 2, is.numeric)

Here the first argument is the data we’re passing in, the third argument is the function we want to apply to the data (is.numeric()), and the second argument is the margin over which we want to apply that function (1 = rows, 2 = columns, etc.). And just like that, we’ve cut the length of our code in half!

Unfortunately, when my wife tried to use apply(), her script broke. It didn’t break in any obvious way, mind you (i.e., with a crash and an error message); instead, the apply() call returned a perfectly good vector. It’s just that all of the values in that vector were FALSE. Meaning, R had decided that none of the columns in my wife’s data frame were numeric–which was most certainly incorrect. And because the code wasn’t throwing an error, and the apply() call was embedded within a longer function, it wasn’t obvious to my wife–as an R newbie and a novice programmer–what had gone wrong. From her perspective, the regression analyses she was trying to run with lm() were breaking with strange messages. So she spent a couple of hours trying to debug her code before asking me for help.

Anyway, I took a look at the help documentation, and the source of the problem turned out to be the following: apply() only operates over matrices or vectors, and not on data frames. So when you pass a data frame to apply() as the input, it’s implicitly converted to a matrix. Unfortunately, because matrices can only contain values of one data type, any data frame that has at least one string column will end up being converted to a string (or, in R’s nomenclature, character) matrix. And so now when we apply the is.numeric() function to each column of the matrix, the answer is always going to be FALSE, because all of the columns have been converted to character vectors. So apply() is actually doing exactly what it’s supposed to; it’s just that it doesn’t deign to tell you that it’s implicitly casting your data frame to a matrix before doing anything else. The upshot is that unless you carefully read the apply() documentation and have a basic understanding of data types (which, if you’ve just started dabbling in R, you may well not), you’re hosed.

At this point I could have–and probably should have–thrown in the towel and just suggested to my wife that she use an explicit loop. But that would have dealt a mortal blow to my pride as an experienced-if-not-yet-guru-level R user. So of course I did what any self-respecting programmer does: I went and googled it. And the first thing I came across was the all.is.numeric() function in the Hmisc package which has the following description:

Tests, without issuing warnings, whether all elements of a character vector are legal numeric values.

Perfect! So now the solution to my wife’s problem became this:

library(Hmisc)
apply(ice.cream, 2, all.is.numeric)

…which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution. So I googled some more. And came across a relevant Stack Exchange answer, which had the following simple solution to my wife’s exact problem:

sapply(ice.cream, is.numeric)

You’ll notice that this is virtually identical to the apply() approach that crashed. That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. And since data frames are actually lists, there’s no problem passing in a data frame and iterating over its columns. So just like that, we have an elegant one-line solution to the original problem that doesn’t invoke any loops or third-party packages.

Now, having used apply() a million times, I probably should have known about sapply(). And actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart. In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. In other words, in 2012, I’m the kind of experienced R user that you might generously call “not very good at R”, and, less generously, “dumb”.

On the plus side, the end product is undeniably cool, right? There are very few languages in which you could achieve so much functionality so compactly right out of the box. And this isn’t an isolated case; base R includes a zillion high-level functions to do similarly complex things with data in a fraction of the code you’d need to write in most other languages. Once you throw in the thousands of high-quality user-contributed packages, there’s nothing else like it in the world of statistical computing.

Anyway, this inordinately long story does have a point to it, I promise, so let me sum up:

  • If I had just ignored the desire to be efficient and clever, and had told my wife to solve the problem the way she’d solve it in most other languages–with a simple for-loop–it would have taken her a couple of minutes to figure out, and she’d probably never have run into any problems.
  • If I’d known R slightly better, I would have told my wife to use sapply(). This would have taken her 10 seconds and she’d definitely never have run into any problems.
  • BUT: because I knew enough R to be clever but not enough R to avoid being stupid, I created an entirely avoidable problem that consumed a couple of hours of my wife’s time. Of course, now she knows about both apply() and sapply(), so you could argue that in the long run, I’ve probably still saved her time. (I’d say she also learned something about her husband’s stubborn insistence on pretending he knows what he’s doing, but she’s already the world-leading expert on that topic.)

Anyway, this anecdote is basically a microcosm of my entire experience with R. I suspect many other people will relate. Basically what it boils down to is that R gives you a certain amount of rope to work with. If you don’t know what you’re doing at all, you will most likely end up accidentally hanging yourself with that rope. If, on the other hand, you’re a veritable R guru, you will most likely use that rope to tie some really fancy knots, scale tall buildings, fashion yourself a space tuxedo, and, eventually, colonize brave new statistical worlds. For everyone in between novice and guru (e.g., me), using R on a regular basis is a continual exercise in alternately thinking “this is fucking awesome” and banging your head against the wall in frustration at the sheer stupidity (either your own, or that of the people who designed this awful language). But the good news is that the longer you use R, the more of the former and the fewer of the latter experiences you have. And at the end of the day, it’s totally worth it: the language is powerful enough to make you forget all of the weird syntax, strange naming conventions, choking on large datasets, and issues with data type conversions.

Oh, except when your wife is yelling at gently reprimanding you for wasting several hours of her time on a problem she could have solved herself in 5 minutes if you hadn’t insisted that she do it the idiomatic R way. Then you remember exactly why R is the master troll of statistical languages.

 

 

* R users will probably notice that I use the = operator for assignment instead of the <- operator even though the latter is the officially prescribed way to do it in R (i.e., a <- 2 is favored over a = 2). That’s because these two idioms are interchangeable in all but one (rare) use case, and personally I prefer to avoid extra keystrokes whenever possible. But the fact that you can do even basic assignment in two completely different ways in R drives home the point about how pathologically flexible–and, to a new user, confusing–the language is.

what I’ve learned from a failed job search

For the last few months, I’ve been getting a steady stream of emails in my inbox that go something like this:

Dear Dr. Yarkoni,

We recently concluded our search for the position of Assistant Grand Poobah of Academic Sciences in the Area of Multidisciplinary Widget Theory. We received over seventy-five thousand applications, most of them from truly exceptional candidates whose expertise and experience would have been welcomed with open arms at any institution of higher learning–or, for that matter, by the governing board of a small planet. After a very careful search process (which most assuredly did not involve a round or two on the golf course every afternoon, and most certainly did not culminate in a wild injection of an arm into a hat filled with balled-up names) we regret to inform you that we are unable to offer you this position. This should not be taken to imply that your intellectual ability or accomplishments are in any way inferior to those of the person who we ultimately did offer the position to (or rather, persons–you see, we actually offered the job to six people before someone accepted it); what we were attempting to optimize, we hope you understand, was not the quality of the candidate we hired, but a mythical thing called ‘fit’ between yourself and ourselves. Or, to put it another way, it’s not you, it’s us.

We wish you all the best in your future endeavors, and rest assured that if we have another opening in future, we will celebrate your reapplication by once again balling your name up and tossing it into a hat along with seventy-five thousand others.

These letters are typically so warm and fuzzy that it’s hard to feel bad about them. I mean, yes, they’re basically telling me I failed at something, but then, how often does anyone ever actually tell me I’m an impressive, accomplished, human being? Never! If every failure in my life was accompanied by this kind of note, I’d be much more willing to try new things. Though, truth be told, I probably wouldn’t try very hard at anything; it would be worth failing in advance just to get this kind of affirmation.

Anyway, the reason I’ve been getting these letters, as you might surmise, is that I’ve been applying for academic jobs. I’ve been doing this for two years now, and will be doing it for a third year in a row in a few months, which I’m pretty sure qualifies me a world-recognized expert on the process. So in the interest of helping other people achieve the same prowess at failing to secure employment, I’ve decided to share some of the lessons I’ve learned here. This missive comes to you with all of the standard caveats and qualifiers–like, for example, that you should be sitting down when you read this; that you should double-check with people you actually respect to make sure any of this makes sense; and, most importantly, that you’ve completely lost your mind if you try to actually apply any of this ‘knowledge’ to your own personal situation. With that in mind, Here’s What I’ve Learned:

1. The academic job market is really, really, bad. No, seriously. I’ve heard from people at several major research universities that they received anywhere from 150 to 500 applications for individual positions (the latter for open-area positions). Some proportion of these applications come from people who have no real shot at the position, but a huge proportion are from truly exceptional candidates with many, many publications, awards, and glowing letters of recommendation. People who you would think have a bright future in research ahead of them. Except that many of them don’t actually have a bright future in research ahead of them, because these days all of that stuff apparently isn’t enough to land a tenure-track position–and often, isn’t even enough to land an interview.

Okay, to be fair, the situation isn’t quite that bad across the board. For one thing, I was quite selective about my job search this past year. I applied for 22 positions, which may sound like a lot, but there were a lot of ads last year, and I know people with similar backgrounds to mine who applied to 50 – 80 positions and could have expanded their searches still further. So, depending on what kind of position you’re aiming for–particularly if you’re interested in a teaching-heavy position at a small school–the market may actually be quite reasonable at the moment. What I’m talking about here really only applies to people looking for research-intensive positions at major research universities. And specifically, to people looking for jobs primarily in the North American market. I recognize that’s probably a minority of people graduating with PhDs in psychology, but since it’s my blog, you’ll have to live with my peculiar little biases. With that qualifier in mind, I’ll reiterate again: the market sucks right now.

2. I’m not as awesome as I thought I was. Lest you think I’ve suddenly turned humble, let me reassure you that I still think I’m pretty awesome–and I can back that up with hard evidence, because I currently have about 20 emails in my inbox from fancy-pants search committee members telling me what a wonderful, accomplished human being I am. I just don’t think I’m as awesome as I thought I was a year ago. Mind you, I’m not quite so delusional that I expected to have my choice of jobs going in, but I did think I had a decent enough record–twenty-odd publications, some neat projects, a couple of major grant proposals submitted (and one that looks very likely to get funded)–to land at least one or two interviews. I was wrong. Which means I’ve had to take my ego down a peg or two. On balance, that’s probably not a bad thing.

3. It’s hard to get hired without a conventional research program. Although I didn’t get any interviews, I did hear back informally from a couple of places (in addition to those wonderful form letters, I mean), and I’ve had hallway conversations with many people who’ve sat on search committees before. The general feedback has been that my work focuses too much on methods development and not enough on substantive questions. This doesn’t really come as a surprise; back when I was putting together my research statement and application materials, pretty much everyone I talked to strongly advised me to focus on a content area first and play down my methods work, because, they said, no one really hires people who predominantly work on methods–at least in psychology. I thought (and still think) this is excellent advice, and in fact it’s exactly the same advice I give to other people if they make the mistake of asking me for my opinion. But ultimately, I went ahead and marketed myself as a methods person anyway. My reasoning was that I wouldn’t want to show up for a new job having sold myself as a person who does A, B, and C, and then mostly did X, Y, and Z, with only a touch of A thrown in. Or, you know, to put it in more cliched terms, I want people to like me for meeeeeee.

I’m still satisfied with this strategy, even if it ends up costing me a few interviews and a job offer or two (admittedly, this is a bit presumptuous–more likely than not, I wouldn’t have gotten any interviews this time around no matter how I’d framed my application). I do the kind of work I do because I enjoy it and think it’s important; I’m pretty happy where I am, so I don’t feel compelled to–how can I put this diplomatically–fib to search committees. Which isn’t to say that I’m laboring under any illusion that you always have to be completely truthful when applying for jobs; I’m fully aware that selling yourself framing your application around your strengths–and telling people what they want to hear to some extent–is a natural and reasonable thing to do. So I’m not saying this out of any bitterness or naivete; I’m just explaining why I chose to go the honest route that was unlikely to land me a job as opposed to the slightly less honest route that was very slightly more likely to land me a job.

4. There’s a large element of luck involved in landing an academic job. Or, for that matter, pretty much any other kind of job. I’m not saying it’s all luck, of course; far from it. In practice, a single group of maybe three dozen people seem end up filling the bulk of interview slots at major research universities in any given year. Which is to say, while the majority of applicants will go without any interviews at all, some people end up with a dozen or more of them. So it’s clearly very far from a random process; in the long run, better candidates are much more likely to get jobs. But for any given job, the odds of getting an interview and/or job offer depend on any number of factors that you have little or no control over: what particular area the department wants to shore up; what courses need to be taught; how your personality meshes with the people who interview you; which candidate a particular search committee member idiosyncratically happens to take a shining to, and so on. Over the last few months, I’ve found it useful to occasionally remind myself of this fact when my inbox doth overfloweth with rejection letters. Of course, there’s a very thin line between justifiably attributing your negative outcomes to bad luck and failing to take responsibility for things that are under your control, so it’s worth using the power of self-serving rationalization sparingly.

 

In any case, those vacuous observations lessons aside, my plan at this point is still to keep doing essentially the same thing I’ve done the last two years, which consists of (i) putting together what I hope is a strong, if somewhat unconventional, application package; (ii) applying for jobs very selectively–only to places that I think I’d be as happy or happier at than I am in my current position; and (iii) otherwise spending as little of my time as possible thinking about my future employment status, and as much of it as possible concentrating on my research and personal life.

I don’t pretend to think this is a good strategy in general; it’s just what I’ve settled on and am happy with for the moment. But ask me again a year from now and who knows, maybe I’ll be roaming around downtown Boulder fishing quarters out of the creek for lunch money. In the meantime, I hope this rather uneventful report of my rather uneventful job-seeking experience thus far is of some small use to someone else. Oh, and if you’re on a search committee and think you want to offer me a job, I’m happy to negotiate the terms of my employment in the comments below.

Big Pitch or Big Lottery? The unenviable task of evaluating the grant review system

This week’s issue of Science has an interesting article on The Big Pitch–a pilot NSF initiative to determine whether anonymizing proposals and dramatically cutting down their length (from 15 pages to 2) has a substantial impact on the results of the review process. The answer appears to be an unequivocal yes. From the article:

What happens is a lot, according to the first two rounds of the Big Pitch. NSF’s grant reviewers who evaluated short, anonymized proposals picked a largely different set of projects to fund compared with those chosen by reviewers presented with standard, full-length versions of the same proposals.

Not surprisingly, the researchers who did well under the abbreviated format are pretty pleased:

Shirley Taylor, an awardee during the evolution round of the Big Pitch, says a comparison of the reviews she got on the two versions of her proposal convinced her that anonymity had worked in her favor. An associate professor of microbiology at Virginia Commonwealth University in Richmond, Taylor had failed twice to win funding from the National Institutes of Health to study the role of an enzyme in modifying mitochondrial DNA.

Both times, she says, reviewers questioned the validity of her preliminary results because she had few publications to her credit. Some reviews of her full proposal to NSF expressed the same concern. Without a biographical sketch, Taylor says, reviewers of the anonymous proposal could “focus on the novelty of the science, and this is what allowed my proposal to be funded.”

Broadly speaking, there are two ways to interpret the divergent results of the standard and abbreviated review. The charitable interpretation is that the change in format is, in fact, beneficial, inasmuch as it eliminates prior reputation as one source of bias and forces reviewers to focus on the big picture rather than on small methodological details. Of course, as Prof-Like Substance points out in an excellent post, one could mount a pretty reasonable argument that this isn’t necessarily a good thing. After all, a scientist’s past publication record is likely to be a good predictor of their future success, so it’s not clear that proposals should be anonymous when large amounts of money are on the line (and there are other ways to counteract the bias against newbies–e.g., NIH’s approach of explicitly giving New Investigators a payline boost until they get their first R01). And similarly, some scientists might be good at coming up with big ideas that sound plausible at first blush and not so good at actually carrying out the research program required to bring those big ideas to fruition. Still, at the very least, if we’re being charitable, The Big Pitch certainly does seem like a very different kind of approach to review.

The less charitable interpretation is that the reason the ratings of the standard and abbreviated proposals showed very little correlation is that the latter approach is just fundamentally unreliable. If you suppose that it’s just not possible to reliably distinguish a very good proposal from a somewhat good one on the basis of just 2 pages, it makes perfect sense that 2-page and 15-page proposal ratings don’t correlate much–since you’re basically selecting at random in the 2-page case. Understandably, researchers who happen to fare well under the 2-page format are unlikely to see it that way; they’ll probably come up with many plausible-sounding reasons why a shorter format just makes more sense (just like most researchers who tend to do well with the 15-page format probably think it’s the only sensible way for NSF to conduct its business). We humans are all very good at finding self-serving rationalizations for things, after all.

Personally I don’t have very strong feelings about the substantive merits of short versus long-format review–though I guess I do find it hard to believe that 2-page proposals could be ranked very reliably given that some very strange things seem to happen with alarming frequency even with 12- and 15-page proposals. But it’s an empirical question, and I’d love to see relevant data. In principle, the NSF could have obtained that data by having two parallel review panels rate all of the 2-page proposals (or even 4 panels, since one would also like to know how reliable the normal review process is). That would allow the agency to directly quantify the reliability of the ratings by looking at their cross-panel consistency. Absent that kind of data, it’s very hard to know whether the results Science reports on are different because 2-page review emphasizes different (but important) things, or because a rating process based on an extended 2-page abstract just amounts to a glorified lottery.

Alternatively, and perhaps more pragmatically, NSF could just wait a few years to see how the projects funded under the pilot program turn out (and I’m guessing this is part of their plan). I.e., do the researchers who do well under the 2-page format end producing science as good as (or better than) the researchers who do well under the current system? This sounds like a reasonable approach in principle, but the major problem is that we’re only talking about a total of ~25 funded proposals (across two different review panels), so it’s unclear that there will be enough data to draw any firm conclusions. Certainly many scientists (including me) are likely to feel a bit uneasy at the thought that NSF might end up making major decisions about how to allocate billions of dollars on the basis of two dozen grants.

Anyway, skepticism aside, this isn’t really meant as a criticism of NSF so much as an acknowledgment of the fact that the problem in question is a really, really difficult one. The task of continually evaluating and improving the grant review process is not one anyone should want to take on lightly. If time and money were no object, every proposed change (like dramatically shortened proposals) would be extensively tested on a large scale and directly compared to the current approach before being implemented. Unfortunately, flying thousands of scientists to Washington D.C. is a very expensive business (to say nothing of all the surrounding costs), and I imagine that testing out a substantively different kind of review process on a large scale could easily run into the tens of millions of dollars. In a sense, the funding agencies can’t really win. On the one hand, if they only ever pilot new approaches on a small scale, they never get enough empirical data to confidently back major changes in policy. On the other hand, if they pilot new approaches on a large scale and those approaches end up failing to improve on the current system (as is the fate of most innovative new ideas), the funding agencies get hammered by politicians and scientists alike for wasting taxpayer money in an already-harsh funding climate.

I don’t know what the solution is (or if there is one), but if nothing else, I do think it’s a good thing that NSF and NIH continue to actively tinker with their various processes. After all, if there’s anything most researchers can agree on, it’s that the current system is very far from perfect.

A very classy reply from Karl Friston

After writing my last post critiquing Karl Friston’s commentary in NeuroImage, I emailed him the link, figuring he might want the opportunity to respond, and also to make sure he knew my commentary wasn’t intended as a personal attack (I have enormous respect for his seminal contributions to the field of neuroimaging). Here’s his very classy reply (posted with permission):

Many thanks for your kind e-mail and link to your blog. I thought your review and deconstruction of the issues were excellent and I concur with the points that you make.

You are absolutely right that I ignored the use of high (corrected) thresholds when controlling for multiple comparisons ““ and was focusing on the simple case of a single test. I also agree that, ideally, one would report confidence intervals on effect sizes ““ indeed the original version of my article concluded with this recommendation (now the last line of appendix 1). I remember ““ at the inception of SPM ““ discussing with Andrew Holmes the reporting of confidence intervals using statistical maps ““ however, the closest we ever got was posterior probability maps (PPM), many years later.

My agenda was probably a bit simpler than you might have supposed ““ it was to point out that significant p-values from small sample studies are valid and will ““ on average ““ detect effects whose sizes are bigger than the equivalent effects with larger sample sizes. I did not mean to imply that large studies are useless ““ although I do believe that unqualified reports of significant p-values from large sample sizes should be treated with caution. Although my agenda was fairly simple, the issues raised may well require more serious consideration ““ of the sort that you have offered. I submitted the article as a “˜comments and controversy’, anticipating that it would elicit a thoughtful response of the sort in your blog. If you have not done so already; you could prepare your blog for peer-reviewed submission ““ perhaps as a response to the “˜comments and controversy’ at NeuroImage?

I will not respond to your blog directly; largely because I have never blogged before and prefer to restrict myself to peer-reviewed formats. However, please feel free to use this e-mail in any way you see fit.

With very best wishes,

Karl

PS: although you may have difficulty believing it ““ all the critiques I caricatured I have actually seen in one form or another ““ even the retinotopic mapping critique!

Seeing as my optimistic thought when I sent Friston the link was “I hope he doesn’t eat me alive” (not because he has that kind of reputation, but because, frankly, if someone obnoxiously sent me a link to an abrasive article criticizing my work at length, I might not be very happy either), I was very happy to read that. I wrote back:

Thanks very much for your gracious reply–especially since the tone of my commentary was probably a bit abrasive. If I’m being honest with myself, I’m pretty sure I’d have a hard time setting my ego aside long enough to respond this constructively if someone criticized me like this (no matter how I felt about the substantive issues), so it’s very much appreciated.

I won’t take up any of the substantive issues here, since it sounds like we’re in reasonable agreement on most of the issues. As far as submitting a formal response to NeuroImage, I’d normally be happy to do that, but I’m currently boycotting Elsevier journals as part of the Cost of Knowledge campaign, and feel pretty strongly about that, so I won’t be submitting anything to NeuroImage for the foreseeable future. This isn’t meant as an indictment of the NeuroImage editorial board or staff in any way; it’s strictly out of frustration at Elsevier’s policies and recent actions.

Also, while I like the comments and controversies format at NeuroImage a lot, there’s no question that the process is considerably slower than what online communication affords. The reality is that by the time my comment ever came out (probably in a much abridged form), much of the community will have moved on and lost interest in the issue, and I’ve found in the past that the kind of interactive and rapid engagement it’s possible to get online is very hard to approximate in a print forum. But I can completely understand your hesitation to respond this way; it could quickly become unmanageable. For what it’s worth, I don’t really think blogs are the right medium for this kind of thing in the long term anyway, but until we get publisher-independent evaluation platforms that centralize the debate in one place (which I’m hopeful will happen relatively soon), I think they play a useful role.

Anyway, whatever your opinion of the original commentary and/or my post, I think Friston deserves a lot of credit for his response, which, I’ll just reiterate again, is much more civil and tactful than mine would probably have been in his situation. I can’t think of many cases either in print or online when someone has responded so constructively to criticism.

One other thing I forgot to mention in my reply to Friston, but is worth bringing up here: I think SPM confidence interval maps would be a great idea! It would be fantastic if fMRI analysis packages by default produced 3 effect size maps for every analysis–respectively giving the observed, lower bound, and upper bound estimates of effect size at every voxel. This would naturally discourage researchers from making excessively strong claims (since one imagines almost everyone would at least glance at the lower-bound map) while providing reviewers a very easy way to frame concerns about power and sample size (“can the authors please present the confidence interval maps in the appendix?”). Anyone want to write an SPM plug-in to do this?

Sixteen is not magic: Comment on Friston (2012)

UPDATE: I’ve posted a very classy email response from Friston here.

In a “comments and controversies“ piece published in NeuroImage last week, Karl Friston describes “Ten ironic rules for non-statistical reviewers“. As the title suggests, the piece is presented ironically; Friston frames it as a series of guidelines reviewers can follow in order to ensure successful rejection of any neuroimaging paper. But of course, Friston’s real goal is to convince you that the practices described in the commentary are bad ones, and that reviewers should stop picking on papers for such things as having too little power, not cross-validating results, and not being important enough to warrant publication.

Friston’s piece is, simultaneously, an entertaining satire of some lamentable reviewer practices, and—in my view, at least—a frustratingly misplaced commentary on the relationship between sample size, effect size, and inference in neuroimaging. While it’s easy to laugh at some of the examples Friston gives, many of the positions Friston presents and then skewers aren’t just humorous portrayals of common criticisms; they’re simply bad caricatures of comments that I suspect only a small fraction of reviewers ever make. Moreover, the cures Friston proposes—most notably, the recommendation that sample sizes on the order of 16 to 32 are just fine for neuroimaging studies—are, I’ll argue, much worse than the diseases he diagnoses.

Before taking up the objectionable parts of Friston’s commentary, I’ll just touch on the parts I don’t think are particularly problematic. Of the ten rules Friston discusses, seven seem palatable, if not always helpful:

  • Rule 6 seems reasonable; there does seem to be excessive concern about the violation of assumptions of standard parametric tests. It’s not that this type of thing isn’t worth worrying about at some point, just that there are usually much more egregious things to worry about, and it’s been demonstrated that the most common parametric tests are (relatively) insensitive to violations of normality under realistic conditions.
  • Rule 10 is also on point; given that we know the reliability of peer review is very low, it’s problematic when reviewers make the subjective assertion that a paper just isn’t important enough to be published in such-and-such journal, even as they accept that it’s technically sound. Subjective judgments about importance and innovation should be left to the community to decide. That’s the philosophy espoused by open-access venues like PLoS ONE and Frontiers, and I think it’s a good one.
  • Rules 7 and 9—criticizing a lack of validation or a failure to run certain procedures—aren’t wrong, but seem to me much too broad to support blanket pronouncements. Surely much of the time when reviewers highlight missing procedures, or complain about a lack of validation, there are perfectly good reasons for doing so. I don’t imagine Friston is really suggesting that reviewers should stop asking authors for more information or for additional controls when they think it’s appropriate, so it’s not clear what the point of including this here is. The example Friston gives in Rule 9 (of requesting retinotopic mapping in an olfactory study), while humorous, is so absurd as to be worthless as an indictment of actual reviewer practices. In fact, I suspect it’s so absurd precisely because anything less extreme Friston could have come up with would have caused readers to think, “but wait, that could actually be a reasonable concern“¦“
  • Rules 1, 2, and 3 seem reasonable as far as they go; it’s just common sense to avoid overconfidence, arguments from emotion, and tardiness. Still, I’m not sure what’s really accomplished by pointing this out; I doubt there are very many reviewers who will read Friston’s commentary and say “you know what, I’m an overconfident, emotional jerk, and I’m always late with my reviews–I never realized this before.“ I suspect the people who fit that description—and for all I know, I may be one of them—will be nodding and chuckling along with everyone else.

This leaves Rules 4, 5, and 8, which, conveniently, all focus on a set of interrelated issues surrounding low power, effect size estimation, and sample size. Because Friston’s treatment of these issues strikes me as dangerously wrong, and liable to send a very bad message to the neuroimaging community, I’ve laid out some of these issues in considerably more detail than you might be interested in. If you just want the direct rebuttal, skip to the “Reprising the rules“ section below; otherwise the next two sections sketch Friston’s argument for using small sample sizes in fMRI studies, and then describe some of the things wrong with it.

Friston’s argument

Friston’s argument is based on three central claims:

  1. Classical inference (i.e., the null hypothesis testing framework) suffers from a critical flaw, which is that the null is always false: no effects (at least in psychology) are ever truly zero. Collect enough data and you will always end up rejecting the null hypothesis with probability of 1.
  2. Researchers care more about large effects than about small ones. In particular, there is some size of effect that any given researcher will call “˜trivial’, below which that researcher is uninterested in the effect.
  3. If the null hypothesis is always false, and if some effects are not worth caring about in practical terms, then researchers who collect very large samples will invariably end up identifying many effects that are statistically significant but completely uninteresting.

I think it would be hard to dispute any of these claims. The first one is the source of persistent statistical criticism of the null hypothesis testing framework, and the second one is self-evidently true (if you doubt it, ask yourself whether you would really care to continue your research if you knew with 100% confidence that all of your effects would never be any larger than one one-thousandth of a standard deviation). The third one follows directly from the first two.

Where Friston’s commentary starts to depart from conventional wisdom is in the implications he thinks these premises have for the sample sizes researchers should use in neuroimaging studies. Specifically, he argues that since large samples will invariably end up identifying trivial effects, whereas small samples will generally only have power to detect large effects, it’s actually in neuroimaging researchers’ best interest not to collect a lot of data. In other words, Friston turns what most commentators have long considered a weakness of fMRI studies—their small sample size—into a virtue.

Here’s how he characterizes an imaginary reviewer’s misguided concern about low power:

Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Friston suggests that the appropriate response from a clever author would be something like the following:

Response: We would like to thank the reviewer for his or her comments on sample size; however, his or her conclusions are statistically misplaced. This is because a significant result (properly controlled for false positives), based on a small sample indicates the treatment effect is actually larger than the equivalent result with a large sample. In short, not only is our result statistically valid. It is quantitatively more significant than the same result with a larger number of subjects.

This is supported by an extensive appendix (written non-ironically), where Friston presents a series of nice sensitivity and classification analyses intended to give the reader an intuitive sense of what different standardized effect sizes mean, and what the implications are for the detection of statistically significant effects using a classical inference (i.e., hypothesis testing) approach. The centerpiece of the appendix is a loss-function analysis where Friston pits the benefit of successfully detecting a large effect (which he defines as a Cohen’s d of 1, i.e., an effect of one standard deviation) against the cost of rejecting the null when the effect is actually trivial (defined as a d of 0.125 or less). Friston notes that the loss function is minimized (i.e., the difference between the hit rate for large effects and the miss rate for trivial effects is maximized) when n = 16, which is where the number he repeatedly quotes as a reasonable sample size for fMRI studies comes from. (Actually, as I discuss in my Appendix I below, I think Friston’s power calculations are off, and the right number, even given his assumptions, is more like 22. But the point is, it’s a small number either way.)

It’s important to note that Friston is not shy about asserting his conclusion that small samples are just fine for neuroimaging studies—especially in the Appendices, which are not intended to be ironic. He makes claims like the following:

The first appendix presents an analysis of effect size in classical inference that suggests the optimum sample size for a study is between 16 and 32 subjects. Crucially, this analysis suggests significant results from small samples should be taken more seriously than the equivalent results in oversized studies.

And:

In short, if we wanted to optimise the sensitivity to large effects but not expose ourselves to trivial effects, sixteen subjects would be the optimum number.

And:

In short, if you cannot demonstrate a significant effect with sixteen subjects, it is probably not worth demonstrating.

These are very strong claims delivered with minimal qualification, and given Friston’s influence, could potentially lead many reviewers to discount their own prior concerns about small sample size and low power—which would be disastrous for the field. So I think it’s important to explain exactly why Friston is wrong and why his recommendations regarding sample size shouldn’t be taken seriously.

What’s wrong with the argument

Broadly speaking, there are three problems with Friston’s argument. The first one is that Friston presents the absolute best-case scenario as if it were typical. Specifically, the recommendation that a sample of 16 ““ 32 subjects is generally adequate for fMRI studies assumes that  fMRI researchers are conducting single-sample t-tests at an uncorrected threshold of p < .05; that they only care about effects on the order of 1 sd in size; and that any effect smaller than d = .125 is trivially small and is to be avoided. If all of this were true, an n of 16 (or rather, 22—see Appendix I below) might be reasonable. But it doesn’t really matter, because if you make even slightly less optimistic assumptions, you end up in a very different place. For example, for a two-sample t-test at p < .001 (a very common scenario in group difference studies), the optimal sample size, according to Friston’s own loss-function analysis, turns out to be 87 per group, or 174 subjects in total.

I discuss the problems with the loss-function analysis in much more detail in Appendix I below; the main point here is that even if you take Friston’s argument at face value, his own numbers put the lie to the notion that a sample size of 16 ““ 32 is sufficient for the majority of cases. It flatly isn’t. There’s nothing magic about 16, and it’s very bad advice to suggest that authors should routinely shoot for sample sizes this small when conducting their studies given that Friston’s own analysis would seem to demand a much larger sample size the vast majority of the time.

 What about uncertainty?

The second problem is that Friston’s argument entirely ignores the role of uncertainty in drawing inferences about effect sizes. The notion that an effect that comes from a small study is likely to be bigger than one that comes from a larger study may be strictly true in the sense that, for any fixed p value, the observed effect size necessarily varies inversely with sample size. It’s true, but it’s also not very helpful. The reason it’s not helpful is that while the point estimate of statistically significant effects obtained from a small study will tend to be larger, the uncertainty around that estimate is also greater—and with sample sizes in the neighborhood of 16 – 20, will typically be so large as to be nearly worthless. For example, a correlation of r = .75 sounds huge, right? But when that correlation is detected at a threshold of p < .001 in a sample of 16 subjects, the corresponding 99.9% confidence interval is .06 – .95—a range so wide as to be almost completely uninformative.

Fortunately, what Friston argues small samples can do for us indirectly—namely, establish that effect sizes are big enough to care about—can be done much more directly, simply by looking at the uncertainty associated with our estimates. That’s exactly what confidence intervals are for. If our goal is to ensure that we only end up talking about results big enough to care about, it’s surely better to answer the question “how big is the effect?” by saying, “d = 1.1, with a 95% confidence interval of 0.2 – 2.1” than by saying “well it’s statistically significant at p < .001 in a sample of 16 subjects, so it’s probably pretty big”. In fact, if you take the latter approach, you’ll be wrong quite often, for the simple reason that p values will generally be closer to the statistical threshold with small samples than with big ones. Remember that, by definition, the point at which one is allowed to reject the null hypothesis is also the point at which the relevant confidence interval borders on zero. So it doesn’t really matter whether your sample is small or large; if you only just barely managed to reject the null hypothesis, you cannot possibly be in a good position to conclude that the effect is likely to be a big one.

As far as I can tell, Friston completely ignores the role of uncertainty in his commentary. For example, he gives the following example, which is supposed to convince you that you don’t really need large samples:

Imagine we compared the intelligence quotient (IQ) between the pupils of two schools. When comparing two groups of 800 pupils, we found mean IQs of 107.1 and 108.2, with a difference of 1.1. Given that the standard deviation of IQ is 15, this would be a trivial effect size “¦ In short, although the differential IQ may be extremely significant, it is scientifically uninteresting “¦ Now imagine that your research assistant had the bright idea of comparing the IQ of students who had and had not recently changed schools. On selecting 16 students who had changed schools within the past five years and 16 matched pupils who had not, she found an IQ difference of 11.6, where this medium effect size just reached significance. This example highlights the difference between an uninformed overpowered hypothesis test that gives very significant, but uninformative results and a more mechanistically grounded hypothesis that can only be significant with a meaningful effect size.

But the example highlights no such thing. One is not entitled to conclude, in the latter case, that the true effect must be medium-sized just because it came from a small sample. If the effect only just reached significance, the confidence interval by definition just barely excludes zero, and we can’t say anything meaningful about the size of the effect, but only about its sign (i.e., that it was in the expected direction)—which is (in most cases) not nearly as useful.

In fact, we will generally be in a much worse position with a small sample than a large one, because at least with a large sample, we at least stand a chance of being able to distinguish small effects from large ones. Recall that Friston suggests against collecting very large samples for the very reason that they are likely to produce a wealth of statistically-significant-but-trivially-small effects. Well, maybe so, but so what? Why would it be a bad thing to detect trivial effects so long as we were also in an excellent position to know that those effects were trivial? Nothing about the hypothesis-testing framework commits us to treating all of our statistically significant results like they’re equally important. If we have a very large sample, and some of our effects have confidence intervals from 0.02 to 0.15 while others have CIs from 0.42 to 0.52, we would be wise to focus most of our attention on the latter rather than the former. At the very least this seems like a more reasonable approach than deliberately collecting samples so small that they will rarely be able to tell us anything meaningful about the size of our effects.

What about the prior?

The third, and arguably biggest, problem with Friston’s argument is that it completely ignores the prior—i.e., the expected distribution of effect sizes across the brain. Friston’s commentary assumes a uniform prior everywhere; for the analysis to go through, one has to believe that trivial effects and very large effects are equally likely to occur. But this is patently absurd; while that might be true in select situations, by and large, we should expect small effects to be much more common than large ones. In a previous commentary (on the Vul et al “voodoo correlations“ paper), I discussed several reasons for this; rather than go into detail here, I’ll just summarize them:

  • It’s frankly just not plausible to suppose that effects are really as big as they would have to be in order to support adequately powered analyses with small samples. For example, a correlational analysis with 20 subjects at p < .001 would require a population effect size of r = .77 to have 80% power. If you think it’s plausible that focal activation in a single brain region can explain 60% of the variance in a complex trait like fluid intelligence or extraversion, I have some property under a bridge I’d like you to come by and look at.
  • The low-hanging fruit get picked off first. Back when fMRI was in its infancy in the mid-1990s, people could indeed publish findings based on samples of 4 or 5 subjects. I’m not knocking those studies; they taught us a huge amount about brain function. In fact, it’s precisely because they taught us so much about the brain that researchers can no longer stick 5 people in a scanner and report that doing a working memory task robustly activates the frontal cortex. Nowadays, identifying an interesting effect is more difficult—and if that effect were really enormous, odds are someone would have found it years ago. But this shouldn’t surprise us; neuroimaging is now a relatively mature discipline, and effects on the order of 1 sd or more are extremely rare in most mature fields (for a nice review, see Meyer et al (2001)).
  • fMRI studies with very large samples invariably seem to report much smaller effects than fMRI studies with small samples. This can only mean one of two things: (a) large studies are done much more poorly than small studies (implausible—if anything, the opposite should be true); or (b) the true effects are actually quite small in both small and large fMRI studies, but they’re inflated by selection bias in small studies, whereas large studies give an accurate estimate of their magnitude (very plausible).
  • Individual differences or between-group analyses, which have much less power than within-subject analyses, tend to report much more sparing activations. Again, this is consistent with the true population effects being on the small side.

To be clear, I’m not saying there are never any large effects in fMRI studies. Under the right circumstances, there certainly will be. What I’m saying is that, in the absence of very good reasons to suppose that a particular experimental manipulation is going to produce a large effect, our default assumption should be that the vast majority of (interesting) experimental contrasts are going to produce diffuse and relatively weak effects.

Note that Friston’s assertion that “if one finds a significant effect with a small sample size, it is likely to have been caused by a large effect size“ depends entirely on the prior effect size distribution. If the brain maps we look at are actually dominated by truly small effects, then it’s simply not true that a statistically significant effect obtained from a small sample is likely to have been caused by a large effect size. We can see this easily by thinking of a situation in which an experiment has a weak but very diffuse effect on brain activity. Suppose that the entire brain showed “˜trivial’ effects of d = 0.125 in the population, and that there were actually no large effects at all. A one-sample t-test at p < .001 has less than 1% power to detect this effect, so you might suppose, as Friston does, that we could discount the possibility that a significant effect would have come from a trivial effect size. And yet, because a whole-brain analysis typically involves tens of thousands of tests, there’s a very good chance such an analysis will end up identifying statistically significant effects somewhere in the brain. Unfortunately, because the only way to identify a trivial effect with a small sample is to capitalize on chance (Friston discusses this point in his Appendix II, and additional treatments can be found in Ionnadis (2008), or in my 2009 commentary), that tiny effect won’t look tiny when we examine it; it will in all likelihood look enormous.

Since they say a picture is worth a thousand words, here’s one (from an unpublished paper in progress):

The top panel shows you a hypothetical distribution of effects (Pearson’s r) in a 2-dimensional “˜brain’ in the population. Note that there aren’t any astronomically strong effects (though the white circles indicate correlations of .5 or greater, which are certainly very large). The bottom panel shows what happens when you draw random samples of various sizes from the population and use different correction thresholds/approaches. You can see that the conclusion you’d draw if you followed Friston’s advice—i.e., that any effect you observe with n = 20 must be pretty robust to survive correction—is wrong; the isolated region that survives correction at FDR = .05, while “˜real’ in a trivial sense, is not in fact very strong in the true map—it just happens to be grossly inflated by sampling error. This is to be expected; when power is very low but the number of tests you’re performing is very large, the odds are good that you’ll end up identifying some real effect somewhere in the brain–and the estimated effect size within that region will be grossly distorted because of the selection process.

Encouraging people to use small samples is a sure way to ensure that researchers continue to publish highly biased findings that lead other researchers down garden paths trying unsuccessfully to replicate “˜huge’ effects. It may make for an interesting, more publishable story (who wouldn’t rather talk about the single cluster that supports human intelligence than about the complex, highly distributed pattern of relatively weak effects?), but it’s bad science. It’s exactly the same problem geneticists confronted ten or fifteen years ago when the first candidate gene and genome-wide association studies (GWAS) seemed to reveal remarkably strong effects of single genetic variants that subsequently failed to replicate. And it’s the same reason geneticists now run association studies with 10,000+ subjects and not 300.

Unfortunately, the costs of fMRI scanning haven’t come down the same way the costs of genotyping have, so there’s tremendous resistance at present to the idea that we really do need to routinely acquire much larger samples if we want to get a clear picture of how big effects really are. Be that as it may, we shouldn’t indulge in wishful thinking just because of logistical constraints. The fact that it’s difficult to get good estimates doesn’t mean we should pretend our bad estimates are actually good ones.

What’s right with the argument

Having criticized much of Friston’s commentary, I should note that there’s one part I like a lot, and that’s the section on protected inference in Appendix I. The point Friston makes here is that you can still use a standard hypothesis testing approach fruitfully—i.e., without falling prey to the problem of classical inference—so long as you explicitly protect against the possibility of identifying trivial effects. Friston’s treatment is mathematical, but all he’s really saying here is that it makes sense to use non-zero ranges instead of true null hypotheses. I’ve advocated the same approach before (e.g., here), as I’m sure many other people have. The point is simple: if you think an effect of, say, 1/8th of a standard deviation is too small to care about, then you should define a ‘pseudonull’ hypothesis of d = -.125 to .125 instead of a null of exactly zero.

Once you do that, any time you reject the null, you’re now entitled to conclude with reasonable certainty that your effects are in fact non-trivial in size. So I completely agree with Friston when he observes in the conclusion to the Appendix I that:

…the adage “˜you can never have enough data’ is also true, provided one takes care to protect against inference on trivial effect sizes ““ for example using protected inference as described above.

Of course, the reason I agree with it is precisely because it directly contradicts Friston’s dominant recommendation to use small samples. In fact, since rejecting non-zero values is more difficult than rejecting a null of zero, when you actually perform power calculations based on protected inference, it becomes immediately apparent just how inadequate samples on the order of 16 ““ 32 subjects will be most of the time (e.g., rejecting a null of zero when detecting an effect of d = 0.5 with 80% power using a one-sample t-test at p < .05 requires 33 subjects, but if you want to reject a “˜trivial’ effect size of d <= |.125|, that n is now upwards of 50).

Reprising the rules

With the above considerations in mind, we can now turn back to Friston’s rules 4, 5, and 8, and see why his admonitions to reviewers are uncharitable at best and insensible at worst. First, Rule 4 (the under-sampled study). Here’s the kind of comment Friston (ironically) argues reviewers should avoid:

 Reviewer: Unfortunately, this paper cannot be accepted due to the small number of subjects. The significant results reported by the authors are unsafe because the small sample size renders their design insufficiently powered. It may be appropriate to reconsider this work if the authors recruit more subjects.

Perhaps many reviewers make exactly this argument; I haven’t been an editor, so I don’t know (though I can say that I’ve read many reviews of papers I’ve co-reviewed and have never actually seen this particular variant). But even if we give Friston the benefit of the doubt and accept that one shouldn’t question the validity of a finding on the basis of small samples (i.e., we accept that p values mean the same thing in large and small samples), that doesn’t mean the more general critique from low power is itself a bad one. To the contrary, a much better form of the same criticism–and one that I’ve raised frequently myself in my own reviews–is the following:

 Reviewer: the authors draw some very strong conclusions in their Discussion about the implications of their main finding. But their finding issues from a sample of only 16 subjects, and the confidence interval around the effect is consequently very large, and nearly include zero. In other words, the authors’ findings are entirely consistent with the effect they report actually being very small–quite possibly too small to care about. The authors should either weaken their assertions considerably, or provide additional evidence for the importance of the effect.

Or another closely related one, which I’ve also raised frequently:

 Reviewer: the authors tout their results as evidence that region R is ‘selectively’ activated by task T. However, this claim is based entirely on the fact that region R was the only part of the brain to survive correction for multiple comparisons. Given that the sample size in question is very small, and power to detect all but the very largest effects is consequently very low, the authors are in no position to conclude that the absence of significant effects elsewhere in the brain suggests selectivity in region R. With this small a sample, the authors’ data are entirely consistent with the possibility that many other brain regions are just as strongly activated by task T, but failed to attain significance due to sampling error. The authors should either avoid making any claim that the activity they observed is selective, or provide direct statistical support for their assertion of selectivity.

Neither of these criticisms can be defused by suggesting that effect sizes from smaller samples are likely to be larger than effect sizes from large studies. And it would be disastrous for the field of neuroimaging if Friston’s commentary succeeded in convincing reviewers to stop criticizing studies on the basis of low power. If anything, we collectively need to focus far greater attention on issues surrounding statistical power.

Next, Rule 5 (the over-sampled study):

Reviewer: I would like to commend the authors for studying such a large number of subjects; however, I suspect they have not heard of the fallacy of classical inference. Put simply, when a study is overpowered (with too many subjects), even the smallest treatment effect will appear significant. In this case, although I am sure the population effects reported by the authors are significant; they are probably trivial in quantitative terms. It would have been much more compelling had the authors been able to show a significant effect without resorting to large sample sizes. However, this was not the case and I cannot recommend publication.

I’ve already addressed this above; the problem with this line of reasoning is that nothing says you have to care equally about every statistically significant effect you detect. If you ever run into a reviewer who insists that your sample is overpowered and has consequently produced too many statistically significant effects, you can simply respond like this:

 Response: we appreciate the reviewer’s concern that our sample is potentially overpowered. However, this strikes us as a limitation of classical inference rather than a problem with our study. To the contrary, the benefit of having a large sample is that we are able to focus on effect sizes rather than on rejecting a null hypothesis that we would argue is meaningless to begin with. To this end, we now display a second, more conservative, brain activation map alongside our original one that raises the statistical threshold to the point where the confidence intervals around all surviving voxels exclude effects smaller than d = .125. The reviewer can now rest assured that our results protect against trivial effects. We would also note that this stronger inference would not have been possible if our study had had a much smaller sample.

There is rarely if ever a good reason to criticize authors for having a large sample after it’s already collected. You can always raise the statistical threshold to protect against trivial effects if you need to; what you can’t easily do is magic more data into existence in order to shrink your confidence intervals.

Lastly, Rule 8 (exploiting “˜superstitious’ thinking about effect sizes):

 Reviewer: It appears that the authors are unaware of the dangers of voodoo correlations and double dipping. For example, they report effect sizes based upon data (regions of interest) previously identified as significant in their whole brain analysis. This is not valid and represents a pernicious form of double dipping (biased sampling or non-independence problem). I would urge the authors to read Vul et al. (2009) and Kriegeskorte et al. (2009) and present unbiased estimates of their effect size using independent data or some form of cross validation.

Friston’s recommended response is to point out that concerns about double-dipping are misplaced, because the authors are typically not making any claims that the reported effect size is an accurate representation of the population value, but only following standard best-practice guidelines to include effect size measures alongside p values. This would be a fair recommendation if it were true that reviewers frequently object to the mere act of reporting effect sizes based on the specter of double-dipping; but I simply don’t think this is an accurate characterization. In my experience, the impetus for bringing up double-dipping is almost always one of two things: (a) authors getting overly excited about the magnitude of the effects they have obtained, or (b) authors conducting non-independent tests and treating them as though they were independent (e.g., when identifying an ROI based on a comparison of conditions A and B, and then reporting a comparison of A and C without considering the bias inherent in this second test). Both of these concerns are valid and important, and it’s a very good thing that reviewers bring them up.

The right way to determine sample size

If we can’t rely on blanket recommendations to guide our choice of sample size, then what? Simple: perform a power calculation. There’s no mystery to this; both brief and extended treatises on statistical power are all over the place, and power calculators for most standard statistical tests are available online as well as in most off-line statistical packages (e.g., I use the pwr package for R). For more complicated statistical tests for which analytical solutions aren’t readily available (e.g., fancy interactions involving multiple within- and between-subject variables), you can get reasonably good power estimates through simulation.

Of course, there’s no guarantee you’ll like the answers you get. Actually, in most cases, if you’re honest about the numbers you plug in, you probably won’t like the answer you get. But that’s life; nature doesn’t care about making things convenient for us. If it turns out that it takes 80 subjects to have adequate power to detect the effects we care about and expect, we can (a) suck it up and go for n = 80, (b) decide not to run the study, or (c) accept that logistical constraints mean our study will have less power than we’d like (which implies that any results we obtain will offer only a fractional view of what’s really going on). What we don’t get to do is look the other way and pretend that it’s just fine to go with 16 subjects simply because the last time we did that, we got this amazingly strong, highly selective activation that successfully made it into a good journal. That’s the same logic that repeatedly produced unreplicable candidate gene findings in the 1990s, and, if it continues to go unchecked in fMRI research, risks turning the field into a laughing stock among other scientific disciplines.

Conclusion

The point of all this is not to convince you that it’s impossible to do good fMRI research with just 16 subjects, or that reviewers don’t sometimes say silly things. There are many questions that can be answered with 16 or even fewer subjects, and reviewers most certainly do say silly things (I sometimes cringe when re-reading my own older reviews). The point is that blanket pronouncements, particularly when made ironically and with minimal qualification, are not helpful in advancing the field, and can be very damaging. It simply isn’t true that there’s some magic sample size range like 16 to 32 that researchers can bank on reflexively. If there’s any generalization that we can allow ourselves, it’s probably that, under reasonable assumptions, Friston’s recommendations are much too conservative. Typical effect sizes and analysis procedures will generally require much larger samples than neuroimaging researchers are used to collecting. But again, there’s no substitute for careful case-by-case consideration.

In the natural course of things, there will be cases where n = 4 is enough to detect an effect, and others where the effort is questionable even with 100 subjects; unfortunately, we won’t know which situation we’re in unless we take the time to think carefully and dispassionately about what we’re doing. It would be nice to believe otherwise; certainly, it would make life easier for the neuroimaging community in the short term. But since the point of doing science is to discover what’s true about the world, and not to publish an endless series of findings that sound exciting but don’t replicate, I think we have an obligation to both ourselves and to the taxpayers that fund our research to take the exercise more seriously.

 

 

Appendix I: Evaluating Friston’s loss-function analysis

In this appendix I review a number of weaknesses in Friston’s loss-function analysis, and show that under realistic assumptions, the recommendation to use sample sizes of 16 ““ 32 subjects is far too optimistic.

First, the numbers don’t seem to be right. I say this with a good deal of hesitation, because I have very poor mathematical skills, and I’m sure Friston is much smarter than I am. That said, I’ve tried several different power packages in R and finally resorted to empirically estimating power with simulated draws, and all approaches converge on numbers quite different from Friston’s. Even the sensitivity plots seem off by a good deal (for instance, Friston’s Figure 3 suggests around 30% sensitivity with n = 80 and d = 0.125, whereas all the sources I’ve consulted produce a value around 20%). In my analysis, the loss function is minimized at n = 22 rather than n = 16. I suspect the problem is with Friston’s approximation, but I’m open to the possibility that I’ve done something very wrong, and confirmations or disconfirmations are welcome in the comments below. In what follows, I’ll report the numbers I get rather than Friston’s (mine are somewhat more pessimistic, but the overarching point doesn’t change either way).

Second, there’s the statistical threshold. Friston’s analysis assumes that all of our tests are conducted without correction for multiple comparisions (i.e., at p < .05), but this clearly doesn’t apply to the vast majority of neuroimaging studies, which are either conducting massive univariate (whole-brain) analyses, or testing at least a few different ROIs or networks. As soon as you lower the threshold, the optimal sample size returned by the loss-function analysis increases dramatically. If the threshold is a still-relatively-liberal (for whole-brain analysis) p < .001, the loss function is now minimized at 48 subjects–hardly a welcome conclusion, and a far cry from 16 subjects. Since this is probably still the modal fMRI threshold, one could argue Friston should have been trumpeting a sample size of 48 all along—not exactly a “˜small’ sample size given the associated costs.

Third, the n = 16 (or 22) figure only holds for the simplest of within-subject tests (e.g., a one-sample t-test)–again, a best-case scenario (though certainly a common one). It doesn’t apply to many other kinds of tests that are the primary focus of a huge proportion of neuroimaging studies–for instance, two-sample t-tests, or interactions between multiple within-subject factors. In fact, if you apply the same analysis to a two-sample t-test (or equivalently, a correlation test), the optimal sample size turns out to be 82 (41 per group) at a threshold of p < .05, and a whopping 174 (87 per group) at a threshold of p < .001. In other words, if we were to follow Friston’s own guidelines, the typical fMRI researcher who aims to conduct a (liberal) whole-brain individual differences analysis should be collecting 174 subjects a pop. For other kinds of tests (e.g., 3-way interactions), even larger samples might be required.

Fourth, the claim that only large effects–i.e., those that can be readily detected with a sample size of 16–are worth worrying about is likely to annoy and perhaps offend any number of researchers who have perfectly good reasons for caring about effects much smaller than half a standard deviation. A cursory look at most literatures suggests that effects of 1 sd are not the norm; they’re actually highly unusual in mature fields. For perspective, the standardized difference in height between genders is about 1.5 sd; the validity of job interviews for predicting success is about .4 sd; and the effect of gender on risk-taking (men take more risks) is about .2 sd—what Friston would call a very small effect (for other examples, see Meyer et al., 2001). Against this backdrop, suggesting that only effects greater than 1 sd (about the strength of the association between height and weight in adults) are of interest would seem to preclude many, and perhaps most, questions that researchers currently use fMRI to address. Imaging genetics studies are immediately out of the picture; so too, in all likelihood, are cognitive training studies, most investigations of individual differences, and pretty much any experimental contrast that claims to very carefully isolate a relatively subtle cognitive difference. Put simply, if the field were to take Friston’s analysis seriously, the majority of its practitioners would have to pack up their bags and go home. Entire domains of inquiry would shutter overnight.

To be fair, Friston briefly considers the possibility that small sample sizes could be important. But he doesn’t seem to take it very seriously:

Can true but trivial effect sizes can ever be interesting? It could be that a very small effect size may have important implications for understanding the mechanisms behind a treatment effect ““ and that one should maximise sensitivity by using large numbers of subjects. The argument against this is that reporting a significant but trivial effect size is equivalent to saying that one can be fairly confident the treatment effect exists but its contribution to the outcome measure is trivial in relation to other unknown effects“¦

The problem with the latter argument is that the real world is a complicated place, and most interesting phenomena have many causes. A priori, it is reasonable to expect that the vast majority of effects will be small. We probably shouldn’t expect any single genetic variant to account for more than a small fraction of the variation in brain activity, but that doesn’t mean we should give up entirely on imaging genetics. And of course, it’s worth remembering that, in the context of fMRI studies, when Friston talks about “˜very small effect sizes,’ that’s a bit misleading; even medium-sized effects that Friston presumably allows are interesting could be almost impossible to detect at the sample sizes he recommends. For example, a one-sample t-test with n = 16 subjects detects an effect of d = 0.5 only 46% or 5% of the time at p < .05 and p < .001, respectively. Applying Friston’s own loss function analysis to detection of d = 0.5 returns an optimal sample size of n = 63 at p < .05 and n = 139 at p < .001—a message not entirely consistent with the recommendations elsewhere in his commentary.

ResearchBlogging.orgFriston, K. (2012). Ten ironic rules for non-statistical reviewers NeuroImage DOI: 10.1016/j.neuroimage.2012.04.018