The Great Minds Journal Club discusses Westfall & Yarkoni (2016)

[Editorial note: The people and events described here are fictional. But the paper in question is quite real.]

“Dearly Beloved,” The Graduate Student began. “We are gathered here to–”

“Again?” Samantha interrupted. “Again with the Dearly Beloved speech? Can’t we just start a meeting like a normal journal club for once? We’re discussing papers here, not holding a funeral.”

“We will discuss papers,” said The Graduate Student indignantly. “In good time. But first, we have to follow the rules of Great Minds Journal Club. There’s a protocol, you know.”

Samantha was about to point out that she didn’t know, because The Graduate Student was the sole author of the alleged rules, and the alleged rules had a habit of changing every week. But she was interrupted by the sound of the double doors at the back of the room swinging violently inwards.

“Sorry I’m late,” said Jin, strolling into the room, one hand holding what looked like a large bucket of coffee with a lid on top. “What are we reading today?”

“Nothing,” said Lionel. “The reading has already happened. What we’re doing now is discussing the paper that everyone’s already read.”

“Right, right,” said Jin. “What I meant to ask was: what paper that we’ve all already read are we discussing today?”

“Statistically controlling for confounding constructs is harder than you think,” said The Graduate Student.

“I doubt it,” said Jin. “I think almost everything is intolerably difficult.”

“No, that’s the title of the paper,” Lionel chimed in. “Statistically controlling for confounding constructs is harder than you think. By Westfall and Yarkoni. In PLOS ONE. It’s what we picked to read for this week. Remember? Are you on the mailing list? Do you even work here?”

“Do I work here… Hah. Funny man. Remember, Lionel… I’ll be on your tenure committee in the Fall.”

“Why don’t we get started,” said The Graduate Student, eager to prevent a full-out sarcastathon. “I guess we can do our standard thing where Samantha and I describe the basic ideas and findings, talk about how great the paper is, and suggest some possible extensions… and then Jin and Lionel tear it to shreds.”

“Sounds good,” said Jin and Lionel in concert.

“The basic problem the authors highlight is pretty simple,” said Samantha. “It’s easy to illustrate with an example. Say you want to know if eating more bacon is associated with a higher incidence of colorectal cancer–like that paper that came out a while ago suggested. In theory, you could just ask people how often they eat bacon and how often they get cancer, and then correlate the two. But suppose you find a positive correlation–what can you conclude?”

“Not much,” said Pablo–apparently in a talkative mood. It was the first thing he’d said to anyone all day–and it was only 3 pm.

“Right. It’s correlational data,” Samantha continued. “Nothing is being experimentally manipulated here, so we have no idea if the bacon-cancer correlation reflects the effect of bacon itself, or if there’s some other confounding variable that explains the association away.”

“Like, people who exercise less tend to eat more bacon, and exercise also prevents cancer,” The Graduate Student offered.

“Or it could be a general dietary thing, and have nothing to do with bacon per se,” said Jin. “People who eat a lot of bacon also have all kinds of other terrible dietary habits, and it’s really the gestalt of all the bad effects that causes cancer, not any one thing in particular.”

“Or maybe,” suggested Pablo, “a sneaky parasite unknown to science invades the brain and the gut. It makes you want to eat bacon all the time. Because bacon is its intermediate host. And then it also gives you cancer. Just to spite you.”

“Right, it could be any of those things.” Samantha said. “Except for maybe that last one. The point is, there are many potential confounds. If we want to establish that there’s a ‘real’ association between bacon and cancer, we need to somehow remove the effect of other variables that could be correlated with both bacon-eating and cancer-having. The traditional way to do this is to statistical “control for” or “hold constant” the effects of confounding variables. The idea is that you adjust the variables in your regression equation so that you’re essentially asking what would the relationship between bacon and cancer look like if we could eliminate the confounding influence of things like exercise, diet, alcohol, and brain-and-gut-eating parasites? It’s a very common move, and the logic of statistical control is used to justify a huge number of claims all over the social and biological sciences.”

“I just published a paper showing that brain activation in frontoparietal regions predicts people’s economic preferences even after controlling for self-reported product preferences,” said Jin. “Please tell me you’re not going to shit all over my paper. Is that where this is going?”

“It is,” said Lionel gleefully. “That’s exactly where this is going.”

“It’s true,” Samantha said apologetically. “But if it’s any consolation, we’re also going to shit on Lionel’s finding that implicit prejudice is associated with voting behavior after controlling for explicit attitudes.”

“That’s actually pretty consoling,” said Jin, smiling at Lionel.

“So anyway, statistical control is pervasive,” Samantha went on. “But there’s a problem: statistical control–at least the way people typically do it–is a measurement-level technique. Meaning, when you control for the rate of alcohol use in a regression of cancer on bacon, you’re not really controlling for alcohol use. What you’re actually controlling for is just one particular operationalization of alcohol use–which probably doesn’t cover the entire construct, and is also usually measured with some error.”

“Could you maybe give an example,” asked Pablo. He was the youngest in the group, being only a second-year graduate student. (The Graduate Student, by contrast, had been in the club for so long that his real name had long ago been forgotten by the other members of the GMJC.)

“Sure,” said The Graduate Student. “Suppose your survey includes an item like ‘how often do you consume alcoholic beverages’, and the response options include things like never, less than once a month, I’m never not consuming alcoholic beverages, and so on. Now, people are not that great at remembering exactly how often they have a drink–especially the ones who tend to have a lot of drinks. On top of that, there’s a stigma against drinking a lot, so there’s probably going to be some degree of systematic underreporting. All of this contrives to give you a measure that’s less than perfectly reliable–meaning, it won’t give you the same values that you would get if you could actually track people for an extended period of time and accurately measure exactly how much ethanol they consume, by volume. In many, many cases, measured covariates of this kind are pretty mediocre.”

“I see,” said Pablo. “That makes sense. So why is that a problem?”

“Because you can’t control for that which you aren’t measuring,” Samantha said. “Meaning, if your alleged measure of alcohol consumption–or any other variable you care about–isn’t measuring the thing you care about with perfect accuracy, then you can’t remove its influence on other things. It’s easiest to see this if you think about the limiting case where your measurements are completely unreliable. Say you think you’re measuring weekly hours of exercise, but actually your disgruntled research assistant secretly switched out the true exercise measure for randomly generated values. When you then control for the alleged ‘exercise’ variable in your model, how much of the true influence of exercise are you removing?”

“None,” said Pablo.

“Right. Your alleged measure of exercise doesn’t actually reflect anything about exercise, so you’re accomplishing nothing by controlling for it. The same exact point holds–to varying degrees–when your measure is somewhat reliable, but not perfect. Which is to say, pretty much always.”

“You could also think about the same general issue in terms of construct validity,” The Graduate Student chimed in. “What you’re typically trying to do by controlling for something is account for a latent construct or concept you care about–not a specific measure. For example, the latent construct of a “healthy diet” could be measured in many ways. You could ask people how much broccoli they eat, how much sugar or transfat they consume, how often they eat until they can’t move, and so on. If you surveyed people with a lot of different items like this, and then extracted the latent variance common to all of them, then you might get a component that could be interpreted as something like ‘healthy diet’. But if you only use one or two items, they’re going to be very noisy indicators of the construct you care about. Which means you’re not really controlling for how healthy people’s diet is in your model relating bacon to cancer. At best, you’re controlling for, say, self-reported number of vegetables eaten. But there’s a very powerful temptation for authors to forget that caveat, and to instead think that their measurement-level conclusions automatically apply at the construct level. The result is that you end up with a huge number of papers saying things like ‘we show that fish oil promotes heart health even after controlling for a range of dietary and lifestyle factors’. When in fact the measurement-level variables they’ve controlled for can’t help but capture only a tiny fraction of all of the dietary and lifestyle factors that could potentially confound the association you care about.”

“I see,” said Pablo. “But this seems like a pretty basic point, doesn’t it?”

“Yes,” said Lionel. “It’s a problem as old as time itself. It might even be older than Jin.”

Jin smiled at Lionel and tipped her coffee cup-slash-bucket towards him slightly in salute.

“In fairness to the authors,” said The Graduate Student, “they do acknowledge that essentially the same problem has been discussed in many literatures over the past few decades. And they cite some pretty old papers. Oldest one is from… 1965. Kahneman, 1965.”

An uncharacteristic silence fell over the room.

That Kahneman?” Jin finally probed.

“The one and only.”

“Fucking Kahneman,” said Lionel. “That guy could really stand to leave a thing or two for the rest of us to discover.”

“So, wait,” said Jin, evidently coming around to Lionel’s point of view. “These guys cite a 50-year old paper that makes essentially the same argument, and still have the temerity to publish this thing?”

“Yes,” said Samantha and The Graduate Student in unison.

“But to be fair, their presentation is very clear,” Samantha said. “They lay out the problem really nicely–which is more than you can say for many of the older papers. Plus there’s some neat stuff in here that hasn’t been done before, as far as I know.”

“Like what?” asked Lionel.

“There’s a nice framework for analytically computing error rates for any set of simple or partial correlations between two predictors and a DV. And, to save you the trouble of having to write your own code, there’s a Shiny web app.”

“In my day, you couldn’t just write a web app and publish it as a paper,” Jin grumbled. “Shiny or otherwise.”

“That’s because in your day, the internet didn’t exist,” Lionel helpfully offered.

“No internet?” the Graduate Student shrieked in horror. “How old are you, Jin?”

“Old enough to become very wise,” said Jin. “Very, very wise… and very corpulent with federal grant money. Money that I could, theoretically, use to fund–or not fund–a graduate student of my choosing next semester. At my complete discretion, of course.” She shot The Graduate Student a pointed look.

“There’s more,” Samantha went on. “They give some nice examples that draw on real data. Then they show how you can solve the problem with SEM–although admittedly that stuff all builds directly on textbook SEM work as well. And then at the end they go on to do some power calculations based on SEM instead of the standard multiple regression approach. I think that’s new. And the results are… not pretty.”

“How so,” asked Lionel.

“Well. Westfall and Yarkoni suggest that for fairly typical parameter regimes, researchers who want to make incremental validity claims at the latent-variable level–using SEM rather than multiple regression–might be looking at a bare minimum of several hundred participants, and often many thousands, in order to adequately power the desired inference.”

“Ouchie,” said Jin.

“What happens if there’s more than one potential confound?” asked Lionel. “Do they handle the more general multiple regression case, or only two predictors?”

“No, only two predictors,” said The Graduate Student. “Not sure why. Maybe they were worried they were already breaking enough bad news for one day.”

“Could be,” said Lionel. “You have to figure that in an SEM, when unreliability in the predictors is present, the uncertainty is only going to compound as you pile on more covariates–because it’s going to become increasingly unclear how the model should attribute any common variance that the predictor of interest shares with both the DV and at least one other covariate. So whatever power estimates they come up with in the paper for the single-covariate case are probably upper bounds on the ability to detect incremental contributions in the presence of multiple covariates. If you have a lot of covariates–like the epidemiology or nutrition types usually do–and at least some of your covariates are fairly unreliable, things could get ugly really quickly. Who knows what kind of sample sizes you’d need in order to make incremental validity claims about small effects in epi studies where people start controlling for the sun, moon, and stars. Hundreds of thousands? Millions? I have no idea.”

“Jesus,” said The Graduate Student. “That would make it almost impossible to isolate incremental contributions in large observational datasets.”

“Correct,” said Lionel.

“The thing I don’t get,” said Samantha, “is that the epidemiologists clearly already know about this problem. Or at least, some of them do. They’ve written dozens of papers about ‘residual confounding’, which is another name for the same problem Westfall and Yarkoni discuss. And yet there are literally thousands of large-sample, observational papers published in prestigious epidemiology, nutrition, or political science journals that never even mention this problem. If it’s such a big deal, why does almost nobody actually take any steps to address it?”

“Ah…” said Jin. “As the senior member of our group, I can probably answer that question best for you. You see, it turns out it’s quite difficult to publish a paper titled After an extensive series of SEM analyses of a massive observational dataset that cost the taxpayer three million dollars to assemble, we still have no idea if bacon causes cancer. Nobody wants to read that paper. You know what paper people do want to read? The one called Look at me, I eat so much bacon I’m guaranteed to get cancer according to the new results in this paper–but I don’t even care, because bacon is so delicious. That’s the paper people will read, and publish, and fund. So that’s the paper many scientists are going to write.”

A second uncharacteristic silence fell over the room.

“Bit of a downer today, aren’t you,” Lionel finally said. “I guess you’re playing the role of me? I mean, that’s cool. It’s a good look for you.”

“Yes,” Jin agreed. “I’m playing you. Or at least, a smarter, more eloquent, and better-dressed version of you.”

“Why don’t we move on,” Samantha interjected before Lionel could re-arm and respond. “Now that we’ve laid out the basic argument, should we try to work through the details and see what we find?”

“Yes,” said Lionel and Jin in unison–and proceeded to tear the paper to shreds.

Internal consistency is overrated, or How I learned to stop worrying and love shorter measures, Part I

[This is the first of a two-part series motivating and introducing precis, a Python package for automated abbreviation of psychometric measures. In part I, I motivate the search for shorter measures by arguing that internal consistency is highly overrated. In part II, I describe some software that makes it relatively easy to act on this newly-acquired disregard by gleefully sacrificing internal consistency at the altar of automated abbreviation. If you’re interested in this general topic but would prefer a slightly less ridiculous more academic treatment, read this paper with Hedwig Eisenbarth and Scott Lilienfeld, or take a look at look at the demo IPython notebook.]

Developing a new questionnaire measure is a tricky business. There are multiple objectives one needs to satisfy simultaneously. Two important ones are:

  • The measure should be reliable. Validity is bounded by reliability; a highly unreliable measure cannot support valid inferences, and is largely useless as a research instrument.
  • The measure should be as short as is practically possible. Time is money, and nobody wants to sit around filling out a 300-item measure if a 60-item version will do.

Unfortunately, these two objectives are in tension with one another to some degree. Random error averages out as one adds more measurements, so in practice, one of the easiest ways to increase the reliability of a measure is to simply add more items. From a reliability standpoint, it’s often better to have many shitty indicators of a latent construct than a few moderately reliable ones*. For example, Cronbach’s alpha–an index of the internal consistency of a measure–is higher for a 20-item measure with a mean inter-item correlation of 0.1 than for a 5-item measure with a mean inter-item correlation of 0.3.

Because it’s so easy to increase reliability just by adding items, reporting a certain level of internal consistency is now practically a requirement in order for a measure to be taken seriously. There’s a reasonably widespread view that an adequate level of reliability is somewhere around .8, and that anything below around .6 is just unacceptable. Perhaps as a consequence of this convention, researchers developing new questionnaires will typically include as many items as it takes to hit a “good” level of internal consistency. In practice, relatively few measures use fewer than 8 to 10 items to score each scale (though there are certainly exceptions, e.g., the Ten Item Personality Inventory). Not surprisingly, one practical implication of this policy is that researchers are usually unable to administer more than a handful of questionnaires to participants, because nobody has time to sit around filling out a dozen 100+ item questionnaires.

While understandable from one perspective, the insistence on attaining a certain level of internal consistency is also problematic. It’s easy to forget that while reliability may be necessary for validity, high internal consistency is not. One can have an extremely reliable measure that possesses little or no internal consistency. This is trivial to demonstrate by way of thought experiment. As I wrote in this post a few years ago:

Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable.

In fact, we can push this line of thought even further, and say that the perfect measure—in the sense of maximizing both reliability and brevity—should actually have an internal consistency of exactly zero. A value any higher than zero would imply the presence of redundancy between items, which in turn would suggest that we could (at least in theory, though typically not in practice) get rid of one or more items without reducing the amount of variance captured by the measure as a whole.

To use a spatial analogy, suppose we think of each of our measure’s items as a circle in a 2-dimensional space:

circles! we haz them.

Here, our goal is to cover the maximum amount of territory using the smallest number of circles (analogous to capturing as much variance in participant responses as possible using the fewest number of items). By this light, the solution in the above figure is kind of crummy, because it fails to cover much of the space despite having 20 circles to work with. The obvious problem is that there’s a lot of redundancy between the circles—many of them overlap in space. A more sensible arrangement, assuming we insisted on keeping all 20 circles, would look like this:

oOooo

In this case we get complete coverage of the target space just by realigning the circles to minimize overlap.

Alternatively, we could opt to cover more or less the same territory as the first arrangement, but using many fewer circles (in this case, 10):

abbreviated_layout

It turns out that what goes for our toy example in 2D space also holds for self-report measurement of psychological constructs that exist in much higher dimensions. For example, suppose we’re interested in developing a new measure of Extraversion, broadly construed. We want to make sure our measure covers multiple aspects of Extraversion—including sociability, increased sensitivity to reward, assertiveness, talkativeness, and so on. So we develop a fairly large item pool, and then we iteratively select groups of items that (a) have good face validity as Extraversion measures, (b) predict external criteria we think Extraversion should predict (predictive validity), and (c) tend to to correlate with each other modestly-to-moderately. At some point we end up with a measure that satisfies all of these criteria, and then presumably we can publish our measure and go on to achieve great fame and fortune.

So far, so good—we’ve done everything by the book. But notice something peculiar about the way the book would have us do things: the very fact that we strive to maintain reasonably solid correlations between our items actually makes our measurement approach much less efficient. To return to our spatial analogy, it amounts to insisting that our circles have to have a high degree of overlap, so that we know for sure that we’re actually measuring what we think we’re measuring. And to be fair, we do gain something for our trouble, in the sense that we can look at our little plot above and say, a-yup, we’re definitely covering that part of the space. But we also lose something, in that we waste a lot of items (or circles) trying to cover parts of the space that have already been covered by other items.

Why would we do something so inefficient? Well, the problem is that in the real world—unlike in our simple little 2D world—we don’t usually know ahead of time exactly what territory we need to cover. We probably have a fuzzy idea of our Extraversion construct, and we might have a general sense that, you know, we should include both reward-related and sociability-related items. But it’s not as if there’s a definitive and unambiguous answer to the question “what behaviors are part of the Extraversion construct?”. There’s a good deal of variation in human behavior that could in principle be construed as part of the latent Extraversion construct, but that in practice is likely to be overlooked (or deliberately omitted) by any particular measure of Extraversion. So we have to carefully explore the space. And one reasonable way to determine whether any given item within that space is still measuring Extraversion is to inspect its correlations with other items that we consider to be unambiguous Extraversion items. If an item correlates, say, 0.5 with items like “I love big parties” and “I constantly seek out social interactions”, there’s a reasonable case to be made that it measures at least some aspects of Extraversion. So we might decide to keep it in our measure. Conversely, if an item shows very low correlations with other putative Extraversion items, we might incline to throw it out.

Now, there’s nothing intrinsically wrong with this strategy. But what’s important to realize is that, once we’ve settled on a measure we’re happy with, there’s no longer a good reason to keep all of that redundancy hanging around. It may be useful when we first explore the territory, but as soon as we yell out FIN! and put down our protractors and levels (or whatever it is the kids are using to create new measures these days), it’s now just costing us time and money by making data collection less efficient. We would be better off saying something like, hey, now that we know what we’re trying to measure, let’s see if we can measure it equally well with fewer items. And at that point, we’re in the land of criterion-based measure development, where the primary goal is to predict some target criterion as accurately as possible, foggy notions of internal consistency be damned.

Unfortunately, committing ourselves fully to the noble and just cause of more efficient measurement still leaves open the question of just how we should go about eliminating items from our overly long measures. For that, you’ll have to stay tuned for Part II, wherein I use many flowery words and some concise Python code to try to convince you that this piece of software provides one reasonable way to go about it.

* On a tangential note, this is why traditional pre-publication peer review isn’t very effective, and is in dire need of replacement. Meta-analytic estimates put the inter-reviewer reliability across fields at around .2 to .3, and it’s rare to have more than two or three reviewers on a paper. No psychometrician would recommend evaluating people’s performance in high-stakes situations with just two items that have a ~.3 correlation, yet that’s how we evaluate nearly all of the scientific literature!

The reviewer’s dilemma, or why you shouldn’t get too meta when you’re supposed to be writing a review that’s already overdue

When I review papers for journals, I often find myself facing something of a tension between two competing motives. On the one hand, I’d like to evaluate each manuscript as an independent contribution to the scientific literature–i.e., without having to worry about how the manuscript stacks up against other potential manuscripts I could be reading. The rationale being that the plausibility of the findings reported in a manuscript shouldn’t really depend on what else is being published in the same journal, or in the field as a whole: if there are methodological problems that threaten the conclusions, they shouldn’t become magically more or less problematic just because some other manuscript has (or doesn’t have) gaping holes. Reviewing should simply be a matter of documenting one’s major concerns and suggestions and sending them back to the Editor for infallible judgment.

The trouble with this idea is that if you’re of a fairly critical bent, you probably don’t believe the majority of the findings reported in the manuscripts sent to you to review. Empirically, this actually appears to be the right attitude to hold, because as a good deal of careful work by biostatisticians like John Ioannidis shows, most published research findings are false, and most true associations are inflated. So, in some ideal world, where the job of a reviewer is simply to assess the likelihood that the findings reported in a paper provide an accurate representation of reality, and/or to identify ways of bringing those findings closer in line with reality, skepticism is the appropriate default attitude. Meaning, if you keep the question “why don’t I believe these results?” firmly in mind as you read through a paper and write your review, you probably aren’t going to go wrong all that often.

The problem is that, for better or worse, one’s job as a reviewer isn’t really–or at least, solely–to evaluate the plausibility of other people’s findings. In large part, it’s to evaluate the plausibility of reported findings in relation to the other stuff that routinely gets published in the same journal. For instance, if you regularly reviewing papers for a very low-tier journal, the editor is probably not going to be very thrilled to hear you say “well, Ms. Editor, none of the last 15 papers you’ve sent me are very good, so you should probably just shut down the journal.” So a tension arises between writing a comprehensive review that accurately captures what the reviewer really thinks about the results–which is often (at least in my case) something along the lines of “pffft, there’s no fucking way this is true”–and writing a review that weighs the merits of the reviewed manuscript relative to the other candidates for publication in the same journal.

To illustrate, suppose I review a paper and decide that, in my estimation, there’s only a 20% chance the key results reported in the paper would successfully replicate (for the sake of argument, we’ll pretend I’m capable of this level of precision). Should I recommend outright rejection? Maybe, since 1 in 5 odds of long-term replication don’t seem very good. But then again, what if 20% is actually better than average? What if I think the average article I’m sent to review only has a 10% chance of holding up over time? In that case, if I recommend rejection of the 20% article, and the editor follows my recommendation, most of the time I’ll actually be contributing to the journal publishing poorer quality articles than if I’d recommended accepting the manuscript, even if I’m pretty sure the findings reported in the manuscript are false.

Lest this sound like I’m needlessly overanalyzing the review process instead of buckling down and writing my own overdue reviews (okay, you’re right, now stop being a jerk), consider what happens when you scale the problem up. When journal editors send reviewers manuscripts to look over, the question they really want an answer to is, “how good is this paper compared to everything else that crosses my desk?” But most reviewers naturally incline to answer a somewhat different–and easier–question, namely, “in the grand scheme of life, the universe, and everything, how good is this paper?” The problem, then, is that if the variance in curmudgeonliness between reviewers exceeds the (reliable) variance within reviewers, then arguably the biggest factor in determining whether or not a given paper gets rejected is simply who happens to review it. Not how much expertise the reviewer has, or even how ‘good’ they are (in the sense that some reviewers are presumably better than others at identifying serious problems and overlooking trivial ones), but simply how critical they are on average. Which is to say, if I’m Reviewer 2 on your manuscript, you’ll probably have a better chance of rejection than if Reviewer 2 is someone who characteristically writes one-paragraph reviews that begin with the words “this is an outstanding and important piece of work…”

Anyway, on some level this is a pretty trivial observation; after all, we all know that the outcome of the peer review process is, to a large extent, tantamount to a roll of the dice. We know that there are cranky reviewers and friendly reviewers, and we often even have a sense of who they are, which is why we often suggest people to include or exclude as reviewers in our cover letters. The practical question though–and the reason for bringing this up here–is this: given that we have this obvious and ubiquitous problem of reviewers having different standards for what’s publishable, and that this undeniably impacts the outcome of peer review, are there any simple steps we could take to improve the reliability of the review process?

The way I’ve personally made peace between my desire to provide the most comprehensive and accurate review I can and the pragmatic need to evaluate each manuscript in relation to other manuscripts is to use the “comments to the Editor” box to provide some additional comments about my review. Usually what I end up doing is writing my review with little or no thought for practical considerations such as “how prestigious is this journal” or “am I a particularly harsh reviewer” or “is this a better or worse paper than most others in this journal”. Instead, I just write my review, and then when I’m done, I use the comments to the editor to say things like “I’m usually a pretty critical reviewer, so don’t take the length of my review as an indication I don’t like the manuscript, because I do,” or, “this may seem like a negative review, but it’s actually more positive than most of my reviews, because I’m a huge jerk.” That way I can appease my conscience by writing the review I want to while still giving the editor some indication as to where I fit in the distribution of reviewers they’re likely to encounter.

I don’t know if this approach makes any difference at all, and maybe editors just routinely ignore this kind of thing; it’s just the best solution I’ve come up with that I can implement all by myself, without asking anyone else to change their behavior. But if we allow ourselves to contemplate alternative approaches that include changes to the review process itself (while still adhering to the standard pre-publication review model, which, like many other people, I’ve argued is fundamentally dysfunctional), then there are many other possibilities.

One idea, for instance, would be to include calibration questions that could be used to estimate (and correct for) individual differences in curmudgeonliness. For instance, in addition to questions about the merit of the manuscript itself, the review form could have a question like “what proportion of articles you review do you estimate end up being rejected?” or “do you consider yourself a more critical or less critical reviewer than most of your peers?”

Another, logistically more difficult, idea would be to develop a centralized database of review outcomes, so that editors could see what proportion of each reviewer’s assignments ultimately end up being rejected (though they couldn’t see the actual content of the reviews). I don’t know if this type of approach would improve matters at all; it’s quite possible that the review process is fundamentally so inefficient and slow that editors just don’t have the time to spend worrying about this kind of thing. But it’s hard to believe that there aren’t some simple calibration steps we could take to bring reviewers into closer alignment with one another–even if we’re confined to working within the standard pre-publication model of peer review. And given the abysmally low reliability of peer review, even small improvements could potentially produce large benefits in the aggregate.

functional MRI and the many varieties of reliability

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”” paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a “˜reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered “˜excellent’, results between 0.59 and 0.75 be considered “˜good’, results between .40 and .58 be considered “˜fair’, and results lower than 0.40 be considered “˜poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences