the ‘decline effect’ doesn’t work that way

Over the last four or five years, there’s been a growing awareness in the scientific community that science is an imperfect process. Not that everyone used to think science was a crystal ball with a direct line to the universe or anything, but there does seem to be a growing recognition that scientists are human beings with human flaws, and are susceptible to common biases that can make it more difficult to fully trust any single finding reported in the literature. For instance, scientists like interesting results more than boring results; we’d rather keep our jobs than lose them; and we have a tendency to see what we want to see, even when it’s only sort-of-kind-of there, and sometimes not there at all. All of these things contrive to produce systematic biases in the kinds of findings that get reported.

The single biggest contributor to the zeitgeist shift nudge is undoubtedly John Ioannidis (recently profiled in an excellent Atlantic article), whose work I can’t say enough good things about (though I’ve tried). But lots of other people have had a hand in popularizing the same or similar ideas–many of which actually go back several decades. I’ve written a bit about these issues myself in a number of papers (1, 2, 3) and blog posts (1, 2, 3, 4, 5), so I’m partial to such concerns. Still, important as the role of the various selection and publication biases is in charting the course of science, virtually all of the discussions of these issues have had a relatively limited audience. Even Ioannidis’ work, influential as it’s been, has probably been read by no more than a few thousand scientists.

Last week, the debate hit the mainstream when the New Yorker (circulation: ~ 1 million) published an article by Jonah Lehrer suggesting–or at least strongly raising the possibility–that something might be wrong with the scientific method. The full article is behind a paywall, but I can helpfully tell you that some people seem to have un-paywalled it against the New Yorker’s wishes, so if you search for it online, you will find it.

The crux of Lehrer’s argument is that many, and perhaps most, scientific findings fall prey to something called the “decline effect”: initial positive reports of relatively large effects are subsequently followed by gradually decreasing effect sizes, in some cases culminating in a complete absence of an effect in the largest, most recent studies. Lehrer gives a number of colorful anecdotes illustrating this process, and ends on a decidedly skeptical (and frankly, terribly misleading) note:

The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that’s often not the case. Just because an idea is true doesn’t mean it can be proved. And just because an idea can be proved doesn’t mean it’s true. When the experiments are done, we still have to choose what to believe.

While Lehrer’s article received pretty positive reviews from many non-scientist bloggers (many of whom, dismayingly, seemed to think the take-home message was that since scientists always change their minds, we shouldn’t trust anything they say), science bloggers were generally not very happy with it. Within days, angry mobs of Scientopians and Nature Networkers started murdering unicorns; by the end of the week, the New Yorker offices were reduced to rubble, and the scientists and statisticians who’d given Lehrer quotes were all rumored to be in hiding.

Okay, none of that happened. I’m just trying to keep things interesting. Anyway, because I’ve been characteristically lazy slow on the uptake, by the time I got around to writing this post you’re now reading, about eighty hundred and sixty thousand bloggers had already weighed in on Lehrer’s article. That’s good, because it means I can just direct you to other people’s blogs instead of having to do any thinking myself. So here you go: good posts by Games With Words (whose post tipped me off to the article), Jerry Coyne, Steven Novella, Charlie Petit, and Andrew Gelman, among many others.

Since I’ve blogged about these issues before, and agree with most of what’s been said elsewhere, I’ll only make one point about the article. Which is that about half of the examples Lehrer talks about don’t actually seem to me to qualify as instances of the decline effect–at least as Lehrer defines it. The best example of this comes when Lehrer discusses Jonathan Schooler’s attempt to demonstrate the existence of the decline effect by running a series of ESP experiments:

In 2004, Schooler embarked on an ironic imitation of Rhine’s research: he tried to replicate this failure to replicate. In homage to Rhirie’s interests, he decided to test for a parapsychological phenomenon known as precognition. The experiment itself was straightforward: he flashed a set of images to a subject and asked him or her to identify each one. Most of the time, the response was negative—-the images were displayed too quickly to register. Then Schooler randomly selected half of the images to be shown again. What he wanted to know was whether the images that got a second showing were more likely to have been identified the first time around. Could subsequent exposure have somehow influenced the initial results? Could the effect become the cause?

The craziness of the hypothesis was the point: Schooler knows that precognition lacks a scientific explanation. But he wasn’t testing extrasensory powers; he was testing the decline effect. “At first, the data looked amazing, just as we’d expected,“ Schooler says. “I couldn’t believe the amount of precognition we were finding. But then, as we kept on running subjects, the effect size“–a standard statistical measure–“kept on getting smaller and smaller.“ The scientists eventually tested more than two thousand undergraduates. “In the end, our results looked just like Rhinos,“ Schooler said. “We found this strong paranormal effect, but it disappeared on us.“

This is a pretty bad way to describe what’s going on, because it makes it sound like it’s a general principle of data collection that effects systematically get smaller. It isn’t. The variance around the point estimate of effect size certainly gets smaller as samples get larger, but the likelihood of an effect increasing is just as high as the likelihood of it decreasing. The absolutely critical point Lehrer left out is that you would only get the decline effect to show up if you intervened in the data collection or reporting process based on the results you were getting. Instead, most of Lehrer’s article presents the decline effect as if it’s some sort of mystery, rather than the well-understood process that it is. It’s as though Lehrer believes that scientific data has the magical property of telling you less about the world the more of it you have. Which isn’t true, of course; the problem isn’t that science is malfunctioning, it’s that scientists are still (kind of!) human, and are susceptible to typical human biases. The unfortunate net effect is that Lehrer’s article, while tremendously entertaining, achieves exactly the opposite of what good science journalism should do: it sows confusion about the scientific process and makes it easier for people to dismiss the results of good scientific work, instead of helping people develop a critical appreciation for the amazing power science has to tell us about the world.

the parable of zoltan and his twelve sheep, or why a little skepticism goes a long way

What follows is a fictional piece about sheep and statistics. I wrote it about two years ago, intending it to serve as a preface to an article on the dangers of inadvertent data fudging. But then I decided that no journal editor in his or her right mind would accept an article that started out talking about thinking sheep. And anyway, the rest of the article wasn’t very good. So instead, I post this parable here for your ovine amusement. There’s a moral to the story, but I’m too lazy to write about it at the moment.

A shepherd named Zoltan lived in a small village in the foothills of the Carpathian Mountains. He tended to a flock of twelve sheep: Soffia, Krystyna, Anastasia, Orsolya, Marianna, Zigana, Julinka, Rozalia, Zsa Zsa, Franciska, Erzsebet, and Agi. Zoltan was a keen observer of animal nature, and would often point out the idiosyncracies of his sheep’s behavior to other shepherds whenever they got together.

“Anastasia and Orsolya are BFFs. Whatever one does, the other one does too. If Anastasia starts licking her face, Orsolya will too; if Orsolya starts bleating, Anastasia will start harmonizing along with her.“

“Julinka has a limp in her left leg that makes her ornery. She doesn’t want your pity, only your delicious clovers.“

“Agi is stubborn but logical. You know that old saying, spare the rod and spoil the sheep? Well, it doesn’t work for Agi. You need calculus and rhetoric with Agi.“

Zoltan’s colleagues were so impressed by these insights that they began to encourage him to record his observations for posterity.

“Just think, Zoltan,“ young Gergely once confided. “If something bad happened to you, the world would lose all of your knowledge. You should write a book about sheep and give it to the rest of us. I hear you only need to know six or seven related things to publish a book.“

On such occasions, Zoltan would hem and haw solemnly, mumbling that he didn’t know enough to write a book, and that anyway, nothing he said was really very important. It was false modestly of course; in reality, he was deeply flattered, and very much concerned that his vast body of sheep knowledge would disappear along with him one day. So one day, Zoltan packed up his knapsack, asked Gergely to look after his sheep for the day, and went off to consult with the wise old woman who lived in the next village.

The old woman listened to Zoltan’s story with a good deal of interest, nodding sagely at all the right moments. When Zoltan was done, the old woman mulled her thoughts over for a while.

“If you want to be taken seriously, you must publish your findings in a peer-reviewed journal,” she said finally.

“What’s Pier Evew?” asked Zoltan.

“One moment,” said the old woman, disappearing into her bedroom. She returned clutching a dusty magazine. “Here,” she said, handing the magazine to Zoltan. “This is peer review.”

That night, after his sheep had gone to bed, Zoltan stayed up late poring over Vol. IV, Issue 5 of Domesticated Animal Behavior Quarterly. Since he couldn’t understand the figures in the magazine, he read it purely for the articles. By the time he put the magazine down and leaned over to turn off the light, the first glimmerings of an empirical research program had begun to dance around in his head. Just like fireflies, he thought. No, wait, those really were fireflies. He swatted them away.

“I like this“¦ science,” he mumbled to himself as he fell asleep.

In the morning, Zoltan went down to the local library to find a book or two about science. He checked out a volume entitled Principia Scientifica Buccolica—a masterful derivation from first principles of all of the most common research methods, with special applications to animal behavior. By lunchtime, Zoltan had covered t-tests, and by bedtime, he had mastered Mordenkainen’s correction for inestimable herds.

In the morning, Zoltan made his first real scientific decision.

“Today I’ll collect some pilot data,” he thought to himself, “and tomorrow I’ll apply for an R01.”

His first set of studies tested the provocative hypothesis that sheep communicate with one another by moving their ears back and forth in Morse code. Study 1 tested the idea observationally. Zoltan and two other raters (his younger cousins), both blind to the hypothesis, studied sheep in pairs, coding one sheep’s ear movements and the other sheep’s behavioral responses. Studies 2 through 4 manipulated the sheep’s behavior experimentally. In Study 2, Zoltan taped the sheep’s ears to their head; in Study 3, he covered their eyes with opaque goggles so that they couldn’t see each other’s ears moving. In Study 4, he split the twelve sheep into three groups of four in order to determine whether smaller groups might promote increased sociability.

That night, Zoltan minded the data. “It’s a lot like minding sheep,“ Zoltan explained to his cousin Griga the next day. “You need to always be vigilant, so that a significant result doesn’t get away from you.“

Zoltan had been vigilant, and the first 4 studies produced a number of significant results. In Study 1, Zoltan found that sheep appeared to coordinate ear twitches: if one sheep twitched an ear several times in a row, it was a safe bet that other sheep would start to do the same shortly thereafter (p < .01). There was, however, no coordination of licking, headbutting, stamping, or bleating behaviors, no matter how you sliced and diced it. “It’s a highly selective effect,“ Zoltan concluded happily. After all, when you thought about it, it made sense. If you were going to pick just one channel for sheep to communicate through, ear twitching was surely a good one. One could make a very good evolutionary argument that more obvious methods of communication (e.g., bleating loudly) would have been detected by humans long ago, and that would be no good at all for the sheep.

Studies 2 and 3 further supported Zoltan’s story. Study 2 demonstrated that when you taped sheep’s ears to their heads, they ceased to communicate entirely. You could put Rozalia and Erzsebet in adjacent enclosures and show Rozalia the Jack of Spades for three or four minutes at a time, and when you went to test Erzsebet, she still wouldn’t know the Jack of Spades from the Three of Diamonds. It was as if the sheep were blind! Except they weren’t blind, they were dumb. Zoltan knew; he had made them that way by taping their ears to their heads.

In Study 3, Zoltan found that when the sheep’s eyes were covered, they no longer coordinated ear twitching. Instead, they now coordinated their bleating—but only if you excluded bleats that were produced when the sheep’s heads were oriented downwards. “Fantastic,“ he thought. “When you cover their eyes, they can’t see each other’s ears any more. So they use a vocal channel. This, again, makes good adaptive sense: communication is too important to eliminate entirely just because your eyes happen to be covered. Much better to incur a small risk of being detected and make yourself known in other, less subtle, ways.“

But the real clincher was Study 4, which confirmed that ear twitching occurred at a higher rate in smaller groups than larger groups, and was particularly common in dyads of well-adjusted sheep (like Anastasia and Orsolya, and definitely not like Zsa Zsa and Marianna).

“Sheep are like everyday people,“ Zoltan told his sister on the phone. “They won’t say anything to your face in public, but get them one-on-one, and they won’t stop gossiping about each other.“

It was a compelling story, Zoltan conceded to himself. The only problem was the F test. The difference in twitch rates as a function of group size wasn’t quite statistically significant. Instead, it hovered around p = .07, which the textbooks told Zoltan meant that he was almost right. Almost right was the same thing as potentially wrong, which wasn’t good enough. So the next morning, Zoltan asked Gergely to lend him four sheep so he could increase his sample size.

“Absolutely not,“ said Gergely. “I don’t want your sheep filling my sheep’s heads with all of your crazy new ideas.“

“Look,“ said Zoltan. “If you lend me four sheep, I’ll let you drive my Cadillac down to the village on weekends after I get famous.“

“Deal,“ said Gergely.

So Zoltan borrowed the sheep. But it turned out that four sheep weren’t quite enough; after adding Gergely’s sheep to the sample, the effect only went from p < .07 to p < .06. So Zoltan cut a deal with his other neighbor, Yuri: four of Yuri’s sheep for two days, in return for three days with Zoltan’s new Lexus (once he bought it). That did the trick. Once Zoltan repeated the experiment with Yuri’s sheep, the p-value for Study 2 now came to .046, which the textbooks assured Zoltan meant he was going to be famous.

Data in hand, Zoltan spent the next two weeks writing up his very first journal article. He titled it “Baa baa baa, or not: Sheep communicate via non-verbal channels“—a decidedly modest title for the first empirical work to demonstrate that sheep are capable of sophisticated propositional thought. The article was published to widespread media attention and scientific acclaim, and Zoltan went on to have a productive few years in animal behavioral research, studying topics as interesting and varied as giraffe calisthenics and displays of affection in the common leech.

Much later, it turned out that no one was able to directly replicate his original findings with sheep (though some other researchers did manage to come up with conceptual replications). But that didn’t really matter to Zoltan, because by then he’d decided science was too demanding a career anyway; it was way more fun to lay under trees counting his sheep. Counting sheep, and occasionally, on Saturdays, driving down to the village in his new Lexus,  just to impress all the young cowgirls.