[Editorial note: The people and events described here are fictional. But the paper in question is quite real.]
“Dearly Beloved,” The Graduate Student began. “We are gathered here to–”
“Again?” Samantha interrupted. “Again with the Dearly Beloved speech? Can’t we just start a meeting like a normal journal club for once? We’re discussing papers here, not holding a funeral.”
“We will discuss papers,” said The Graduate Student indignantly. “In good time. But first, we have to follow the rules of Great Minds Journal Club. There’s a protocol, you know.”
Samantha was about to point out that she didn’t know, because The Graduate Student was the sole author of the alleged rules, and the alleged rules had a habit of changing every week. But she was interrupted by the sound of the double doors at the back of the room swinging violently inwards.
“Sorry I’m late,” said Jin, strolling into the room, one hand holding what looked like a large bucket of coffee with a lid on top. “What are we reading today?”
“Nothing,” said Lionel. “The reading has already happened. What we’re doing now is discussing the paper that everyone’s already read.”
“Right, right,” said Jin. “What I meant to ask was: what paper that we’ve all already read are we discussing today?”
“Statistically controlling for confounding constructs is harder than you think,” said The Graduate Student.
“I doubt it,” said Jin. “I think almost everything is intolerably difficult.”
“No, that’s the title of the paper,” Lionel chimed in. “Statistically controlling for confounding constructs is harder than you think. By Westfall and Yarkoni. In PLOS ONE. It’s what we picked to read for this week. Remember? Are you on the mailing list? Do you even work here?”
“Do I work here… Hah. Funny man. Remember, Lionel… I’ll be on your tenure committee in the Fall.”
“Why don’t we get started,” said The Graduate Student, eager to prevent a full-out sarcastathon. “I guess we can do our standard thing where Samantha and I describe the basic ideas and findings, talk about how great the paper is, and suggest some possible extensions… and then Jin and Lionel tear it to shreds.”
“Sounds good,” said Jin and Lionel in concert.
“The basic problem the authors highlight is pretty simple,” said Samantha. “It’s easy to illustrate with an example. Say you want to know if eating more bacon is associated with a higher incidence of colorectal cancer–like that paper that came out a while ago suggested. In theory, you could just ask people how often they eat bacon and how often they get cancer, and then correlate the two. But suppose you find a positive correlation–what can you conclude?”
“Not much,” said Pablo–apparently in a talkative mood. It was the first thing he’d said to anyone all day–and it was only 3 pm.
“Right. It’s correlational data,” Samantha continued. “Nothing is being experimentally manipulated here, so we have no idea if the bacon-cancer correlation reflects the effect of bacon itself, or if there’s some other confounding variable that explains the association away.”
“Like, people who exercise less tend to eat more bacon, and exercise also prevents cancer,” The Graduate Student offered.
“Or it could be a general dietary thing, and have nothing to do with bacon per se,” said Jin. “People who eat a lot of bacon also have all kinds of other terrible dietary habits, and it’s really the gestalt of all the bad effects that causes cancer, not any one thing in particular.”
“Or maybe,” suggested Pablo, “a sneaky parasite unknown to science invades the brain and the gut. It makes you want to eat bacon all the time. Because bacon is its intermediate host. And then it also gives you cancer. Just to spite you.”
“Right, it could be any of those things.” Samantha said. “Except for maybe that last one. The point is, there are many potential confounds. If we want to establish that there’s a ‘real’ association between bacon and cancer, we need to somehow remove the effect of other variables that could be correlated with both bacon-eating and cancer-having. The traditional way to do this is to statistical “control for” or “hold constant” the effects of confounding variables. The idea is that you adjust the variables in your regression equation so that you’re essentially asking what would the relationship between bacon and cancer look like if we could eliminate the confounding influence of things like exercise, diet, alcohol, and brain-and-gut-eating parasites? It’s a very common move, and the logic of statistical control is used to justify a huge number of claims all over the social and biological sciences.”
“I just published a paper showing that brain activation in frontoparietal regions predicts people’s economic preferences even after controlling for self-reported product preferences,” said Jin. “Please tell me you’re not going to shit all over my paper. Is that where this is going?”
“It is,” said Lionel gleefully. “That’s exactly where this is going.”
“It’s true,” Samantha said apologetically. “But if it’s any consolation, we’re also going to shit on Lionel’s finding that implicit prejudice is associated with voting behavior after controlling for explicit attitudes.”
“That’s actually pretty consoling,” said Jin, smiling at Lionel.
“So anyway, statistical control is pervasive,” Samantha went on. “But there’s a problem: statistical control–at least the way people typically do it–is a measurement-level technique. Meaning, when you control for the rate of alcohol use in a regression of cancer on bacon, you’re not really controlling for alcohol use. What you’re actually controlling for is just one particular operationalization of alcohol use–which probably doesn’t cover the entire construct, and is also usually measured with some error.”
“Could you maybe give an example,” asked Pablo. He was the youngest in the group, being only a second-year graduate student. (The Graduate Student, by contrast, had been in the club for so long that his real name had long ago been forgotten by the other members of the GMJC.)
“Sure,” said The Graduate Student. “Suppose your survey includes an item like ‘how often do you consume alcoholic beverages’, and the response options include things like never, less than once a month, I’m never not consuming alcoholic beverages, and so on. Now, people are not that great at remembering exactly how often they have a drink–especially the ones who tend to have a lot of drinks. On top of that, there’s a stigma against drinking a lot, so there’s probably going to be some degree of systematic underreporting. All of this contrives to give you a measure that’s less than perfectly reliable–meaning, it won’t give you the same values that you would get if you could actually track people for an extended period of time and accurately measure exactly how much ethanol they consume, by volume. In many, many cases, measured covariates of this kind are pretty mediocre.”
“I see,” said Pablo. “That makes sense. So why is that a problem?”
“Because you can’t control for that which you aren’t measuring,” Samantha said. “Meaning, if your alleged measure of alcohol consumption–or any other variable you care about–isn’t measuring the thing you care about with perfect accuracy, then you can’t remove its influence on other things. It’s easiest to see this if you think about the limiting case where your measurements are completely unreliable. Say you think you’re measuring weekly hours of exercise, but actually your disgruntled research assistant secretly switched out the true exercise measure for randomly generated values. When you then control for the alleged ‘exercise’ variable in your model, how much of the true influence of exercise are you removing?”
“None,” said Pablo.
“Right. Your alleged measure of exercise doesn’t actually reflect anything about exercise, so you’re accomplishing nothing by controlling for it. The same exact point holds–to varying degrees–when your measure is somewhat reliable, but not perfect. Which is to say, pretty much always.”
“You could also think about the same general issue in terms of construct validity,” The Graduate Student chimed in. “What you’re typically trying to do by controlling for something is account for a latent construct or concept you care about–not a specific measure. For example, the latent construct of a “healthy diet” could be measured in many ways. You could ask people how much broccoli they eat, how much sugar or transfat they consume, how often they eat until they can’t move, and so on. If you surveyed people with a lot of different items like this, and then extracted the latent variance common to all of them, then you might get a component that could be interpreted as something like ‘healthy diet’. But if you only use one or two items, they’re going to be very noisy indicators of the construct you care about. Which means you’re not really controlling for how healthy people’s diet is in your model relating bacon to cancer. At best, you’re controlling for, say, self-reported number of vegetables eaten. But there’s a very powerful temptation for authors to forget that caveat, and to instead think that their measurement-level conclusions automatically apply at the construct level. The result is that you end up with a huge number of papers saying things like ‘we show that fish oil promotes heart health even after controlling for a range of dietary and lifestyle factors’. When in fact the measurement-level variables they’ve controlled for can’t help but capture only a tiny fraction of all of the dietary and lifestyle factors that could potentially confound the association you care about.”
“I see,” said Pablo. “But this seems like a pretty basic point, doesn’t it?”
“Yes,” said Lionel. “It’s a problem as old as time itself. It might even be older than Jin.”
Jin smiled at Lionel and tipped her coffee cup-slash-bucket towards him slightly in salute.
“In fairness to the authors,” said The Graduate Student, “they do acknowledge that essentially the same problem has been discussed in many literatures over the past few decades. And they cite some pretty old papers. Oldest one is from… 1965. Kahneman, 1965.”
An uncharacteristic silence fell over the room.
“That Kahneman?” Jin finally probed.
“The one and only.”
“Fucking Kahneman,” said Lionel. “That guy could really stand to leave a thing or two for the rest of us to discover.”
“So, wait,” said Jin, evidently coming around to Lionel’s point of view. “These guys cite a 50-year old paper that makes essentially the same argument, and still have the temerity to publish this thing?”
“Yes,” said Samantha and The Graduate Student in unison.
“But to be fair, their presentation is very clear,” Samantha said. “They lay out the problem really nicely–which is more than you can say for many of the older papers. Plus there’s some neat stuff in here that hasn’t been done before, as far as I know.”
“Like what?” asked Lionel.
“There’s a nice framework for analytically computing error rates for any set of simple or partial correlations between two predictors and a DV. And, to save you the trouble of having to write your own code, there’s a Shiny web app.”
“In my day, you couldn’t just write a web app and publish it as a paper,” Jin grumbled. “Shiny or otherwise.”
“That’s because in your day, the internet didn’t exist,” Lionel helpfully offered.
“No internet?” the Graduate Student shrieked in horror. “How old are you, Jin?”
“Old enough to become very wise,” said Jin. “Very, very wise… and very corpulent with federal grant money. Money that I could, theoretically, use to fund–or not fund–a graduate student of my choosing next semester. At my complete discretion, of course.” She shot The Graduate Student a pointed look.
“There’s more,” Samantha went on. “They give some nice examples that draw on real data. Then they show how you can solve the problem with SEM–although admittedly that stuff all builds directly on textbook SEM work as well. And then at the end they go on to do some power calculations based on SEM instead of the standard multiple regression approach. I think that’s new. And the results are… not pretty.”
“How so,” asked Lionel.
“Well. Westfall and Yarkoni suggest that for fairly typical parameter regimes, researchers who want to make incremental validity claims at the latent-variable level–using SEM rather than multiple regression–might be looking at a bare minimum of several hundred participants, and often many thousands, in order to adequately power the desired inference.”
“Ouchie,” said Jin.
“What happens if there’s more than one potential confound?” asked Lionel. “Do they handle the more general multiple regression case, or only two predictors?”
“No, only two predictors,” said The Graduate Student. “Not sure why. Maybe they were worried they were already breaking enough bad news for one day.”
“Could be,” said Lionel. “You have to figure that in an SEM, when unreliability in the predictors is present, the uncertainty is only going to compound as you pile on more covariates–because it’s going to become increasingly unclear how the model should attribute any common variance that the predictor of interest shares with both the DV and at least one other covariate. So whatever power estimates they come up with in the paper for the single-covariate case are probably upper bounds on the ability to detect incremental contributions in the presence of multiple covariates. If you have a lot of covariates–like the epidemiology or nutrition types usually do–and at least some of your covariates are fairly unreliable, things could get ugly really quickly. Who knows what kind of sample sizes you’d need in order to make incremental validity claims about small effects in epi studies where people start controlling for the sun, moon, and stars. Hundreds of thousands? Millions? I have no idea.”
“Jesus,” said The Graduate Student. “That would make it almost impossible to isolate incremental contributions in large observational datasets.”
“Correct,” said Lionel.
“The thing I don’t get,” said Samantha, “is that the epidemiologists clearly already know about this problem. Or at least, some of them do. They’ve written dozens of papers about ‘residual confounding’, which is another name for the same problem Westfall and Yarkoni discuss. And yet there are literally thousands of large-sample, observational papers published in prestigious epidemiology, nutrition, or political science journals that never even mention this problem. If it’s such a big deal, why does almost nobody actually take any steps to address it?”
“Ah…” said Jin. “As the senior member of our group, I can probably answer that question best for you. You see, it turns out it’s quite difficult to publish a paper titled After an extensive series of SEM analyses of a massive observational dataset that cost the taxpayer three million dollars to assemble, we still have no idea if bacon causes cancer. Nobody wants to read that paper. You know what paper people do want to read? The one called Look at me, I eat so much bacon I’m guaranteed to get cancer according to the new results in this paper–but I don’t even care, because bacon is so delicious. That’s the paper people will read, and publish, and fund. So that’s the paper many scientists are going to write.”
A second uncharacteristic silence fell over the room.
“Bit of a downer today, aren’t you,” Lionel finally said. “I guess you’re playing the role of me? I mean, that’s cool. It’s a good look for you.”
“Yes,” Jin agreed. “I’m playing you. Or at least, a smarter, more eloquent, and better-dressed version of you.”
“Why don’t we move on,” Samantha interjected before Lionel could re-arm and respond. “Now that we’ve laid out the basic argument, should we try to work through the details and see what we find?”
“Yes,” said Lionel and Jin in unison–and proceeded to tear the paper to shreds.
Kahneman 1965 is the oldest paper we cite that discusses the root issue, but I later found an even older paper from about 30 years prior to that:
Stouffer, S. A. (1936). Evaluating the effect of inadequately measured variables in partial correlation analysis. Journal of the American Statistical Association, 31(194), 348-360.
My sense is that, despite the long history, it remains the case that not a lot of people outside of epidemiology, biostatistics, and psychometrics are aware of these issues. So one of the main functions of our paper is to yet again attempt to raise awareness. But for those who are at least vaguely aware of the issues already, I view the novel contributions of this paper as being basically three things.
First, while methodologists might be familiar with the idea that this is a theoretical problem that can exist in many contexts, few seem to appreciate the scope and magnitude of the problem in practice. What we clearly show is that, in many entirely realistic research situations, the problem can be pretty damn bad, and it tends to get worse (not better) as sample sizes grow.
Second, we point out that there are slightly more subtle forms of the basic incremental validity argument that people haven’t recognized (at least in publication) as suffering from the same general problem. To take the implicit/explicit attitude example from social psychology, we might seek to show that implicit and explicit political attitudes both significantly predict voting intentions, even after controlling for each other. While it is perhaps of some interest that we can predict voting intentions, the more theoretically interesting point (to a social psychologist) is that this would seem to indicate that implicit and explicit attitudes must, in fact, be separable psychological constructs, and not simply two ways of measuring the same thing. But this statistical argument is subject to all the same problems as the more classic incremental validity argument.
Third, we show that while one can perform a more correct statistical analysis that does a good job of controlling the Type 1 error rates, the trade-off is that the Type 2 error rates for this corrected analysis can be extremely high. Basically, if your study exists in a bad part of the parameter space (where reliability of the predictors is not great and measured confounds have strong effects), then it is just inherently difficult to make a good statistical case for the incremental validity of predictor constructs.
I wish every paper came with an accompanying Dialogue and an interactive webapp!