You could be forgiven for thinking that academic psychologists have all suddenly turned into professional whistleblowers. Everywhere you look, interesting new papers are cropping up purporting to describe this or that common-yet-shady methodological practice, and telling us what we can collectively do to solve the problem and improve the quality of the published literature. In just the last year or so, Uri Simonsohn introduced new techniques for detecting fraud, and used those tools to identify at least 3 cases of high-profile, unabashed data forgery. Simmons and colleagues reported simulations demonstrating that standard exploitation of research degrees of freedom in analysis can produce extremely high rates of false positive findings. Pashler and colleagues developed a “Psych file drawer” repository for tracking replication attempts. Several researchers raised trenchant questions about the veracity and/or magnitude of many high-profile psychological findings such as John Bargh’s famous social priming effects. Wicherts and colleagues showed that authors of psychology articles who are less willing to share their data upon request are more likely to make basic statistical errors in their papers. And so on and so forth. The flood shows no signs of abating; just last week, the APS journal Perspectives in Psychological Science announced that it’s introducing a new “Registered Replication Report” section that will commit to publishing pre-registered high-quality replication attempts, irrespective of their outcome.
Personally, I think these are all very welcome developments for psychological science. They’re solid indications that we psychologists are going to be able to police ourselves successfully in the face of some pretty serious problems, and they bode well for the long-term health of our discipline. My sense is that the majority of other researchers–perhaps the vast majority–share this sentiment. Still, as with any zeitgeist shift, there are always naysayers. In discussing these various developments and initiatives with other people, I’ve found myself arguing, with somewhat surprising frequency, with people who for various reasons think it’s not such a good thing that Uri Simonsohn is trying to catch fraudsters, or that social priming findings are being questioned, or that the consequences of flexible analyses are being exposed. Since many of the arguments I’ve come across tend to recur, I thought I’d summarize the most common ones here–along with the rebuttals I usually offer for why, with one possible exception, the arguments for giving a pass to sloppy-but-common methodological practices are not very compelling.
“But everyone does it, so how bad can it be?”
We typically assume that long-standing conventions must exist for some good reason, so when someone raises doubts about some widespread practice, it’s quite natural to question the person raising the doubts rather than the practice itself. Could it really, truly be (we say) that there’s something deeply strange and misguided about using p values? Is it really possible that the reporting practices converged on by thousands of researchers in tens of thousands of neuroimaging articles might leave something to be desired? Could failing to correct for the many researcher degrees of freedom associated with most datasets really inflate the false positive rate so dramatically?
The answer to all these questions, of course, is yes–or at least, we should allow that it could be yes. It is, in principle, entirely possible for an entire scientific field to regularly do things in a way that isn’t very good. There are domains where appeals to convention or consensus make perfect sense, because there are few good reasons to do things a certain way except inasmuch as other people do them the same way. If everyone else in your country drives on the right side of the road, you may want to consider driving on the right side of the road too. But science is not one of those domains. In science, there is no intrinsic benefit to doing things just for the sake of convention. In fact, almost by definition, major scientific advances are ones that tend to buck convention and suggest things that other researchers may not have considered possible or likely.
In the context of common methodological practice, it’s no defense at all to say but everyone does it this way, because there are usually relatively objective standards by which we can gauge the quality of our methods, and it’s readily apparent that there are many cases where the consensus approach leave something to be desired. For instance, you can’t really justify failing to correct for multiple comparisons when you report a single test that’s just barely significant at p < .05 on the grounds that nobody else corrects for multiple comparisons in your field. That may be a valid explanation for why your paper successfully got published (i.e., reviewers didn’t want to hold your feet to the fire for something they themselves are guilty of in their own work), but it’s not a valid defense of the actual science. If you run a t-test on randomly generated data 20 times, you will, on average, get a significant result, p < .05, once. It does no one any good to argue that because the convention in a field is to allow multiple testing–or to ignore statistical power, or to report only p values and not effect sizes, or to omit mention of conditions that didn’t ‘work’, and so on–it’s okay to ignore the issue. There’s a perfectly reasonable question as to whether it’s a smart career move to start imposing methodological rigor on your work unilaterally (see below), but there’s no question that the mere presence of consensus or convention surrounding a methodological practice does not make that practice okay from a scientific standpoint.
“But psychology would break if we could only report results that were truly predicted a priori!”
This is a defense that has some plausibility at first blush. It’s certainly true that if you force researchers to correct for multiple comparisons properly, and report the many analyses they actually conducted–and not just those that “worked”–a lot of stuff that used to get through the filter will now get caught in the net. So, by definition, it would be harder to detect unexpected effects in one’s data–even when those unexpected effects are, in some sense, ‘real’. But the important thing to keep in mind is that raising the bar for what constitutes a believable finding doesn’t actually prevent researchers from discovering unexpected new effects; all it means is that it becomes harder to report post-hoc results as pre-hoc results. It’s not at all clear why forcing researchers to put in more effort validating their own unexpected finding is a bad thing.
In fact, forcing researchers to go the extra mile in this way would have one exceedingly important benefit for the field as a whole: it would shift the onus of determining whether an unexpected result is plausible enough to warrant pursuing away from the community as a whole, and towards the individual researcher who discovered the result in the first place. As it stands right now, if I discover an unexpected result (p < .05!) that I can make up a compelling story for, there’s a reasonable chance I might be able to get that single result into a short paper in, say, Psychological Science. And reap all the benefits that attend getting a paper into a “high-impact” journal. So in practice there’s very little penalty to publishing questionable results, even if I myself am not entirely (or even mostly) convinced that those results are reliable. This state of affairs is, to put it mildly, not A Good Thing.
In contrast, if you as an editor or reviewer start insisting that I run another study that directly tests and replicates my unexpected finding before you’re willing to publish my result, I now actually have something at stake. Because it takes time and money to run new studies, I’m probably not going to bother to follow up on my unexpected finding unless I really believe it. Which is exactly as it should be: I’m the guy who discovered the effect, and I know about all the corners I have or haven’t cut in order to produce it; so if anyone should make the decision about whether to spend more taxpayer money chasing the result, it should be me. You, as the reviewer, are not in a great position to know how plausible the effect truly is, because you have no idea how many different types of analyses I attempted before I got something to ‘work’, or how many failed studies I ran that I didn’t tell you about. Given the huge asymmetry in information, it seems perfectly reasonable for reviewers to say, You think you have a really cool and unexpected effect that you found a compelling story for? Great; go and directly replicate it yourself and then we’ll talk.
“But mistakes happen, and people could get falsely accused!”
Some people don’t like the idea of a guy like Simonsohn running around and busting people’s data fabrication operations for the simple reason that they worry that the kind of approach Simonsohn used to detect fraud is just not that well-tested, and that if we’re not careful, innocent people could get swept up in the net. I think this concern stems from fundamentally good intentions, but once again, I think it’s also misguided.
For one thing, it’s important to note that, despite all the press, Simonsohn hasn’t actually done anything qualitatively different from what other whistleblowers or skeptics have done in the past. He may have suggested new techniques that improve the efficiency with which cheating can be detected, but it’s not as though he invented the ability to report or investigate other researchers for suspected misconduct. Researchers suspicious of other researchers’ findings have always used qualitatively similar arguments to raise concerns. They’ve said things like, hey, look, this is a pattern of data that just couldn’t arise by chance, or, the numbers are too similar across different conditions.
More to the point, perhaps, no one is seriously suggesting that independent observers shouldn’t be allowed to raise their concerns about possible misconduct with journal editors, professional organizations, and universities. There really isn’t any viable alternative. Naysayers who worry that innocent people might end up ensnared by false accusations presumably aren’t suggesting that we do away with all of the existing mechanisms for ensuring accountability; but since the role of people like Simonsohn is only to raise suspicion and provide evidence (and not to do the actual investigating or firing), it’s clear that there’s no way to regulate this type of behavior even if we wanted to (which I would argue we don’t). If I wanted to spend the rest of my life scanning the statistical minutiae of psychology articles for evidence of misconduct and reporting it to the appropriate authorities (and I can assure you that I most certainly don’t), there would be nothing anyone could do to stop me, nor should there be. Remember that accusing someone of misconduct is something anyone can do, but establishing that misconduct has actually occurred is a serious task that requires careful internal investigation. No one–certainly not Simonsohn–is suggesting that a routine statistical test should be all it takes to end someone’s career. In fact, Simonsohn himself has noted that he identified a 4th case of likely fraud that he dutifully reported to the appropriate authorities only to be met with complete silence. Given all the incentives universities and journals have to look the other way when accusations of fraud are made, I suspect we should be much more concerned about the false negative rate than the false positive rate when it comes to fraud.
“But it hurts the public’s perception of our field!”
Sometimes people argue that even if the field does have some serious methodological problems, we still shouldn’t discuss them publicly, because doing so is likely to instill a somewhat negative view of psychological research in the public at large. The unspoken implication being that, if the public starts to lose confidence in psychology, fewer students will enroll in psychology courses, fewer faculty positions will be created to teach students, and grant funding to psychologists will decrease. So, by airing our dirty laundry in public, we’re only hurting ourselves. I had an email exchange with a well-known researcher to exactly this effect a few years back in the aftermath of the Vul et al “voodoo correlations” paper–a paper I commented on to the effect that the problem was even worse than suggested. The argument my correspondent raised was, in effect, that we (i.e., neuroimaging researchers) are all at the mercy of agencies like NIH to keep us employed, and if it starts to look like we’re clowning around, the unemployment rate for people with PhDs in cognitive neuroscience might start to rise precipitously.
While I obviously wouldn’t want anyone to lose their job or their funding solely because of a change in public perception, I can’t say I’m very sympathetic to this kind of argument. The problem is that it places short-term preservation of the status quo above both the long-term health of the field and the public’s interest. For one thing, I think you have to be quite optimistic to believe that some of the questionable methodological practices that are relatively widespread in psychology (data snooping, selective reporting, etc.) are going to sort themselves out naturally if we just look the other way and let nature run its course. The obvious reason for skepticism in this regard is that many of the same criticisms have been around for decades, and it’s not clear that anything much has improved. Maybe the best example of this is Gigerenzer and Sedlmeier’s 1989 paper entitled “Do studies of statistical power have an effect on the power of studies?“, in which the authors convincingly showed that despite three decades of work by luminaries like Jacob Cohen advocating power analyses, statistical power had not risen appreciably in psychology studies. The presence of such unwelcome demonstrations suggests that sweeping our problems under the rug in the hopes that someone (the mice?) will unobtrusively take care of them for us is wishful thinking.
In any case, even if problems did tend to solve themselves when hidden away from the prying eyes of the media and public, the bigger problem with what we might call the “saving face” defense is that it is, fundamentally, an abuse of taxypayers’ trust. As with so many other things, Richard Feynman summed up the issue eloquently in his famous Cargo Cult science commencement speech:
For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of this work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing–and if they don’t want to support you under those circumstances, then that’s their decision.
The fact of the matter is that our livelihoods as researchers depend directly on the goodwill of the public. And the taxpayers are not funding our research so that we can “discover” interesting-sounding but ultimately unreplicable effects. They’re funding our research so that we can learn more about the human mind and hopefully be able to fix it when it breaks. If a large part of the profession is routinely employing practices that are at odds with those goals, it’s not clear why taxpayers should be footing the bill. From this perspective, it might actually be a good thing for the field to revise its standards, even if (in the worst-case scenario) that causes a short-term contraction in employment.
“But unreliable effects will just fail to replicate, so what’s the big deal?”
This is a surprisingly common defense of sloppy methodology, maybe the single most common one. It’s also an enormous cop-out, since it pre-empts the need to think seriously about what you’re doing in the short term. The idea is that, since no single study is definitive, and a consensus about the reality or magnitude of most effects usually doesn’t develop until many studies have been conducted, it’s reasonable to impose a fairly low bar on initial reports and then wait and see what happens in subsequent replication efforts.
I think this is a nice ideal, but things just don’t seem to work out that way in practice. For one thing, there doesn’t seem to be much of a penalty for publishing high-profile results that later fail to replicate. The reason, I suspect, is that we incline to give researchers the benefit of the doubt: surely (we say to ourselves), Jane Doe did her best, and we like Jane, so why should we question the work she produces? If we’re really so skeptical about her findings, shouldn’t we go replicate them ourselves, or wait for someone else to do it?
While this seems like an agreeable and fair-minded attitude, it isn’t actually a terribly good way to look at things. Granted, if you really did put in your best effort–dotted all your i’s and crossed all your t’s–and still ended up reporting a false result, we shouldn’t punish you for it. I don’t think anyone is seriously suggesting that researchers who inadvertently publish false findings should be ostracized or shunned. On the other hand, it’s not clear why we should continue to celebrate scientists who ‘discover’ interesting effects that later turn out not to replicate. If someone builds a career on the discovery of one or more seemingly important findings, and those findings later turn out to be wrong, the appropriate attitude is to update our beliefs about the merit of that person’s work. As it stands, we rarely seem to do this.
In any case, the bigger problem with appeals to replication is that the delay between initial publication of an exciting finding and subsequent consensus disconfirmation can be very long, and often spans entire careers. Waiting decades for history to prove an influential idea wrong is a very bad idea if the available alternative is to nip the idea in the bud by requiring stronger evidence up front.
There are many notable examples of this in the literature. A well-publicized recent one is John Bargh’s work on the motor effects of priming people with elderly stereotypes–namely, that priming people with words related to old age makes them walk away from the experiment more slowly. Bargh’s original paper was published in 1996, and according to Google Scholar, has now been cited over 2,000 times. It has undoubtedly been hugely influential in directing many psychologists’ research programs in certain directions (in many cases, in directions that are equally counterintuitive and also now seem open to question). And yet it’s taken over 15 years for a consensus to develop that the original effect is at the very least much smaller in magnitude than originally reported, and potentially so small as to be, for all intents and purposes, “not real”. I don’t know who reviewed Bargh’s paper back in 1996, but I suspect that if they ever considered the seemingly implausible size of the effect being reported, they might have well thought to themselves, well, I’m not sure I believe it, but that’s okay–time will tell. Time did tell, of course; but time is kind of lazy, so it took fifteen years for it to tell. In an alternate universe, a reviewer might have said, well, this is a striking finding, but the effect seems implausibly large; I would like you to try to directly replicate it in your lab with a much larger sample first. I recognize that this is onerous and annoying, but my primary responsibility is to ensure that only reliable findings get into the literature, and inconveniencing you seems like a small price to pay. Plus, if the effect is really what you say it is, people will be all the more likely to believe you later on.
Or take the actor-observer asymmetry, which appears in just about every introductory psychology textbook written in the last 20 – 30 years. It states that people are relatively more likely to attribute their own behavior to situational factors, and relatively more likely to attribute other agents’ behaviors to those agents’ dispositions. When I slip and fall, it’s because the floor was wet; when you slip and fall, it’s because you’re dumb and clumsy. This putative asymmetry was introduced and discussed at length in a book by Jones and Nisbett in 1971, and hundreds of studies have investigated it at this point. And yet a 2006 meta-analysis by Malle suggested that the cumulative evidence for the actor-observer asymmetry is actually very weak. There are some specific circumstances under which you might see something like the postulated effect, but what is quite clear is that it’s nowhere near strong enough an effect to justify being routinely invoked by psychologists and even laypeople to explain individual episodes of behavior. Unfortunately, at this point it’s almost impossible to dislodge the actor-observer asymmetry from the psyche of most researchers–a reality underscored by the fact that the Jones and Nisbett book has been cited nearly 3,000 times, whereas the 1996 meta-analysis has been cited only 96 times (a very low rate for an important and well-executed meta-analysis published in Psychological Bulletin).
The fact that it can take many years–whether 15 or 45–for a literature to build up to the point where we’re even in a position to suggest with any confidence that an initially exciting finding could be wrong means that we should be very hesitant to appeal to long-term replication as an arbiter of truth. Replication may be the gold standard in the very long term, but in the short and medium term, appealing to replication is a huge cop-out. If you can see problems with an analysis right now that cast aspersions on a study’s results, it’s an abdication of responsibility to downplay your concerns and wait for someone else to come along and spend a lot more time and money trying to replicate the study. You should point out now why you have concerns. If the authors can address them, the results will look all the better for it. And if the authors can’t address your concerns, well, then, you’ve just done science a service. If it helps, don’t think of it as a matter of saying mean things about someone else’s work, or of asserting your own ego; think of it as potentially preventing a lot of very smart people from wasting a lot of time chasing down garden paths–and also saving a lot of taxpayer money. Remember that our job as scientists is not to make other scientists’ lives easy in the hopes they’ll repay the favor when we submit our own papers; it’s to establish and apply standards that produce convergence on the truth in the shortest amount of time possible.
“But it would hurt my career to be meticulously honest about everything I do!”
Unlike the other considerations listed above, I think the concern that being honest carries a price when it comes to do doing research has a good deal of merit to it. Given the aforementioned delay between initial publication and later disconfirmation of findings (which even in the best case is usually longer than the delay between obtaining a tenure-track position and coming up for tenure), researchers have many incentives to emphasize expediency and good story-telling over accuracy, and it would be disingenuous to suggest otherwise. No malevolence or outright fraud is implied here, mind you; the point is just that if you keep second-guessing and double-checking your analyses, or insist on routinely collecting more data than other researchers might think is necessary, you will very often find that results that could have made a bit of a splash given less rigor are actually not particularly interesting upon careful cross-examination. Which means that researchers who have, shall we say, less of a natural inclination to second-guess, double-check, and cross-examine their own work will, to some degree, be more likely to publish results that make a bit of a splash (it would be nice to believe that pre-publication peer review filters out sloppy work, but empirically, it just ain’t so). So this is a classic tragedy of the commons: what’s good for a given individual, career-wise, is clearly bad for the community as a whole.
I wish I had a good solution to this problem, but I don’t think there are any quick fixes. The long-term solution, as many people have observed, is to restructure the incentives governing scientific research in such a way that individual and communal benefits are directly aligned. Unfortunately, that’s easier said than done. I’ve written a lot both in papers (1, 2, 3) and on this blog (see posts linked here) about various ways we might achieve this kind of realignment, but what’s clear is that it will be a long and difficult process. For the foreseeable future, it will continue to be an understandable though highly lamentable defense to say that the cost of maintaining a career in science is that one sometimes has to play the game the same way everyone else plays the game, even if it’s clear that the rules everyone plays by are detrimental to the communal good.
Anyway, this may all sound a bit depressing, but I really don’t think it should be taken as such. Personally I’m actually very optimistic about the prospects for large-scale changes in the way we produce and evaluate science within the next few years. I do think we’re going to collectively figure out how to do science in a way that directly rewards people for employing research practices that are maximally beneficial to the scientific community as a whole. But I also think that for this kind of change to take place, we first need to accept that many of the defenses we routinely give for using iffy methodological practices are just not all that compelling.
Great post. But I’m a little more optimistic about the future –
1) I think reform will happen within the foreseeable future. Already there’s lots of debate and some concrete moves towards pre-registration & replication, these are small steps so far but they’re gathering momentum. At some point it will reach The Tipping Point and take off exponentially. My guess is that this will be when a major journal announces that they’re going to require pre-registration & pre-peer review and will no longer accept manuscripts submitted the old fashioned way… which I think will happen within about 5 years…
2) Individually, one can unilaterally adopt the new approach to publishing (e.g. like this: http://blogs.discovermagazine.com/neuroskeptic/2013/02/03/unilaterally-raising-the-scientific-standard/#.UUBNhDdjp9U) & this could have concrete advantages.
e.g. if you were planning an experiment that you suspect would have controversial results & might struggle to be published, you could publish the protocol ahead of time to prove that your results weren’t the result of bad practice.
P.S. A good example I think comes from clinical trials, where the movement to get them all pre-registered took a while to get going but once they got enough of the big players on board, it all happened very quickly (although they’re still trying to mop up pockets of resistance to publishing all of the raw data.)
I was going to call myself a naysayer but I suppose I mostly agree with you here. I am long term optimistic. However I think the ability of some of these ideas to fix the field is exaggerated by some very enthusiastic people. I tend to be of the “we’ll see” variety.
You have to get to the root of things to make a substantial change. If there’s any group of people that could do that collectively it is the editors of journals. If the editors demand more conservative stats, replications or anything, people will do it. If they demand sexy, counter-intuitive results, they that is what will be submitted.
There’s a lot here I agree with, but here’s one thing where I strongly disagree regarding a priori hypotheses:
“In contrast, if you as an editor or reviewer start insisting that I run another study that directly tests and replicates my unexpected finding before you’re willing to publish my result, I now actually have something at stake.”
Insisting on a priori hypotheses is one thing. Requiring only the publication of a priori hypotheses that were true is another. I think it is perfectly acceptable to publish a statement that X was not an a priori hypothesis, but it was found anyway. This allows the readers to properly understand the context of the discovery and decide whether or not it is worth trying to replicate. If the authors think it’s worth exploring further, they can run a second study and potentially get a second publication (ideally they’d publish successful & failed replication attempts). Publishing every incidental finding as a solid fact is problematic, but keeping all incidental findings out of the published literature seems wasteful.
Wow. This is an excellent summary of the last year or two’s worth of criticism and anxiety over methods in psychology. That said, I think there’s more to it than just defending the status quo? I myself am all aboard the open source/access/methods/technology will save us all train, but I still have some misgivings about the zeal with which many are pushing for radical reform. For instance, I read and enjoy Neuroskeptic’s posts, but neuroskeptic would have us pre-register our own mothers if they could!
I may sound old and conservative by comparison, but maybe the change should be, if not slow, then sluggish. Technology enables us to make massive changes to the current publishing paradigm overnight and I think many of the proposed changes (open data, open access, open code, even pre-registration!) are good, but it is the overnight part that worries me. I think the changes that should happen, should happen in lockstep with our students and young scientists being re-trained. A lot of bad methods are entrenched, not out of maliciousness but because that is what our mentors teach. Many were raised with instructions like Bem’s article on how to write a scientific paper, which champions much of what the field is now being told is bad methods. At any rate, I agree that the truth is not optional, but the way we get to a better science will say a lot about us. We can do it by pitchfork and twitter lynch mobs as we currently are trending towards, or our societies and journals and, most importantly, our mentors, can get their act together and begin a systematic campaign to teach all psychologists that the status quo is, to various degrees, broken. I think this last bit is starting to happen. The recent issue in Perspectives is a good start, and some societies have had symposia on various related topics, but the pace needs to be accelerated. Right now, it is largely preaching to the converted and a few interested bystanders, but for the majority of psychologists, who I would posit do not follow blogs/twitter and ed young 😉 these issues aren’t yet fully appreciated. In the past two years I’ve been repeatedly surprised by how many graduate students I’ve spoken to have not heard of Simonsohn, or experimenter degrees of freedom or the various replication initiatives. Of those who have, I’ve seen two reactions. In some cases they see it as news; interesting, but not something that has any bearing on their own discipline or methods. The “it’s not my status quo” group. And another reaction which is something akin to resentment. Bitterness that a bunch of tenured folk (whether they are or not) who have probably got to where they are by benefiting from the bad methods they now decry (whether they did or not) are now telling them that the way they are being taught to write papers and analyze data is broken. This last reaction is misguided, but I think it speaks to the the fact that there is going to be some inertia to improving the field that stems from the perceived unfairness of upping the bar for those who’ve just entered as more senior folk (and methods folk who many not be experimentalists themselves) are perceived to be pulling up the statistical ladder behind them.
So I don’t know. None of this speaks directly to your point that the truth is not optional. Rather it speaks to a fear I have that the field is going to improve not by better training, but by public lynching. Such has happened with social priming. Rather than begin the discussion on replication of social priming effects in a manner befitting a scientific debate, it instantly descended into allusions to clever Hans and, for those who followed it on twitter, a rather worrisome level of personal vitriol directed at Bargh.
One last thought about replications: Isn’t it necessarily the case, at least in so far as direct replications are concerned, that you will have much more power pooling all the data than replicating it and running the same test twice? In which case replication isn’t quite the panacea it’s made out to be, and instead (as you have argued before) appropriate sample sizes for adequate statistical power should be given more credit. More credit even than direct replications.
While statistical shenanigans are a big problem in neuroimaging there are other big problems which are just as bigger (if not bigger), notably systematic error and the wishful-thinking-based corrections to that error that are commonly applied in neuroimaging. And then there are the situations in which the systematic errors are not corrected or appreciated because the researcher simply does not understand the technology she is using or because, worse yet, she is wilfully blind to problems with the technology.
” If you run a t-test on randomly generated data 20 times, you will, on average, get a significant result, p < .05, once."
Typically 40 times for the two-tailed test. 38/40 will be within the CI, 1/40 will be outside the CI _but in the wrong direction and also dismissed_ and another1/40 will be a false positive.
Neuroskeptic, I said it was an optimistic post, dammit! :p
Personally I’m kind of skeptical that preregistration will work on a large scale, though I think it’s absolutely critical for large, expensive studies. I guess my own money is on post-publication evaluation platforms as the thing that really produces rapid and lasting improvement in quality. But if preregistration is what turns out to work, I’ll gladly support (and engage in) it.
bsci, I completely agree with that. I’m not saying no one should publish anything that wasn’t predicted a priori, I’m saying that the degree of evidence one expects should be proportional to the strength of the claim. If you want to publish an unexpected incidental finding without making too much of it, great; but if you have a finding that seems highly implausible a priori, I think reviewers should feel comfortable asking for more data in cases where the sample is small and/or the methods seem clearly exploratory.
Psychoapocalypse, I’m not sure that’s really a fair characterization of the social priming saga. The way I recall it (and others can jump in) is that the Doyen et al paper in PLoS ONE got a fair bit of press, but the tone was pretty civil. Then Bargh responded to the paper with 2 highly unprofessional and extremely negative blog posts that, by pretty much everyone’s account, were uncalled for and exemplary of how not to respond to criticism. That was when the backlash against Bargh (and, by extension, social priming to some extent) started. But I still think the overall tone is pretty constructive in that most people are arguing that these effects have been substantially overstated, and not that there’s no such thing as social priming and that we’ve all been hoodwinked.
Regarding pooling vs. independent samples, you’re right that pooling is more powerful. The problem is that pooling isn’t really a kosher thing to do if you’ve already collected and analyzed some data, since you’re already capitalizing on the fact you have a known effect. In other words, if you told me prior to conducting your experiment that you were planning to collect data from 50 subjects, and then try to replicate the same effect in another 50, I would tell you to just lump the two groups into one study of 100 subjects. But if you tell me you’ve already run 50 subjects and obtained a massive effect, now the situation is quite different because the decision to acquire more data is already conditional on having obtained that effect, so it’s no longer independent.
Anton, that was a general point about capitalizing on chance, and wasn’t specific to replication attempts (i.e., no commitment to a particular sign was implied), but your point stands in the case of a replication attempt.
I agree with the sentiment that the degree of evidence should be proportional to the strength of the claim, but I don’t like the implication that we become slaves to an arbitrary p-value. As long as journals only publish p < .05, we will see ad hoc explanations presented as a priori, and failure to correctly adjust for multiple comparisons.
Rather than advocating for stricter p-value enforcement, we should be asking scientists to provide confidence intervals or estimates of effect size.
Rich, if you read this blog regularly you’ll know that I’m the last person who’d endorse slavish adherence to p-values. The point I’m making is, if anything, the opposite: the fact that a paper reports an interesting effect at p < .05 is *not* a sufficient reason to accept the paper, and reviewers should often feel free to ask for more data if the effect seems wildly implausible. The point holds whether you're taking p values, effect sizes, or confidence intervals as your metric of plausibility, and I'm with you in thinking we should privilege the latter.