I hate open science

Now that I’ve got your attention: what I hate—and maybe dislike is a better term than hate—isn’t the open science community, or open science initiatives, or open science practices, or open scientists… it’s the term. I fundamentally dislike the term open science. For the last few years, I’ve deliberately tried to avoid using it. I don’t call myself an open scientist, I don’t advocate publicly for open science (per se), and when people use the term around me, I often make a point of asking them to clarify what they mean.

This isn’t just a personal idiosyncracy of mine in a chalk-on-chalkboard sense; I think at this point in time there are good reasons to think the continued use of the term is counterproductive, and we should try to avoid it in most contexts. Let me explain.

It’s ambiguous

At SIPS 2019 last week (SIPS is the Society for Improvement of Psychological Science), I had a brief chat with a British post-undergrad student who was interested in applying to graduate programs in the United States. He asked me what kind of open science community there was at my home institution (the University of Texas at Austin). When I started to reply, I realized that I actually had no idea what question the student was asking me, because I didn’t know his background well enough to provide the appropriate context. What exactly did he mean by “open science”? The term is now used so widely, and in so many different ways, that the student could plausibly have been asking me about any of the following things, either alone or in combination:

  • Reproducibility. Do people [at UT-Austin] value the ability to reproduce, computationally and/or experimentally, the scientific methods used to produce a given result? More concretely, do they conduct their analyses programmatically, rather than using GUIs? Do they practice formal version control? Are there opportunities to learn these kinds of computational skills?
  • Accessibility. Do people believe in making their scientific data, materials, results, papers, etc. publicly, freely, and easily available? Do they work hard to ensure that other scientists, funders, and the taxpaying public can easily get access to what scientists produce?
  • Incentive alignment. Are there people actively working to align individual incentives and communal incentives, so that what benefits an individual scientist also benefits the community at large? Do they pursue local policies meant to promote some of the other practices one might call part of “open science”?
  • Openness of opinion. Do people feel comfortable openly critiquing one another? Is there a culture of discussing (possibly trenchant) problems openly, without defensiveness? Do people take discussion on social media and post-publication review forums seriously?
  • Diversity. Do people value and encourage the participation in science of people from a wide variety of ethnicities, genders, skills, personalities, socioeconomic strata, etc.? Do they make efforts to welcome others into science, invest effort and resources to help them succeed, and accommodate their needs?
  • Metascience and informatics. Are people thinking about the nature of science itself, and reflecting on what it takes to promote a healthy and productive scientific enterprise? Are they developing systematic tools or procedures for better understanding the scientific process, or the work in specific scientific domains?

This is not meant to be a comprehensive list; I have no doubt there are other items one could add (e.g., transparency, collaborativeness, etc.). The point is that open science is, at this point, a very big tent. It contains people who harbor a lot of different values and engage in many different activities. While some of these values and activities may tend to co-occur within people who call themselves open scientists, many don’t. There is, for instance, no particular reason why someone interested in popularizing reproducible science methods should also be very interested in promoting diversity in science. I’m not saying there aren’t people who want to do both (of course there are); empirically, there might even be a modest positive correlation—I don’t know. But they clearly don’t have to go together, and plenty of people are far more invested in one than in the other.

Further, as in any other enterprise, if you monomaniacally push a single value hard enough, then at a certain point, tensions will arise even between values that would ordinarily co-exist peacefully if each given only partial priority. For example, if you think that doing reproducible science well requires a non-negotiable commitment to doing all your analyses programmatically, and maintaining all your code under public version control, then you’re implicitly condoning a certain reduction in diversity within science, because you insist on having only people with a certain set of skills take part in science, and people from some backgrounds are more likely than others (at least at present) to have those skills. Conversely, if diversity in science is the thing you value most, then you need to accept that you’re effectively downgrading the importance of many of the other values listed above in the research process, because any skill or ability you might use to select or promote people in science is necessarily going to reduce (in expectation) the role of other dimensions in the selection process.

This would be a fairly banal and inconsequential observation if we lived in a world where everyone who claimed membership in the open science community shared more or less the same values. But we clearly don’t. In highlighting the ambiguity of the term open science, I’m not just saying hey, just so you know, there are a lot of different activities people call open science; I’m saying that, at this point in time, there are a few fairly distinct sub-communities of people that all identify closely with the term open science and use it prominently to describe themselves or their work, but that actually have fairly different value systems and priorities.

Basically, we’re now at the point where, when someone says they’re an open scientist, it’s hard to know what they actually mean.

It wasn’t always this way; I think ten or even five years ago, if you described yourself as an open scientist, people would have identified you primarily with the movement to open up access to scientific resources and promote greater transparency in the research process. This is still roughly the first thing you find on the Wikipedia entry for Open Science:

Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of an inquiring society, amateur or professional. Open science is transparent and accessible knowledge that is shared and developed through collaborative networks. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge.

That was a fine definition once upon a time, and it still works well for one part of the open science community. But as a general, context-free definition, I don’t think it flies any more. Open science is now much broader than the above suggests.

It’s bad politics

You might say, okay, but so what if open science is an ambiguous term; why can’t that be resolved by just having people ask for clarification? Well, obviously, to some degree it can. My response to the SIPS student was basically a long and winding one that involved a lot of conditioning on different definitions. That’s inefficient, but hopefully the student still got the information he wanted out of it, and I can live with a bit of inefficiency.

The bigger problem though, is that at this point in time, open science isn’t just a descriptive label for a set of activities scientists often engage in; for many people, it’s become an identity. And, whatever you think the value of open science is as an extensional label for a fairly heterogeneous set of activities, I think it makes for terrible identity politics.

There are two reasons for this. First, turning open science from a descriptive label into a full-blown identity risks turning off a lot of scientists who are either already engaged in what one might otherwise call “best practices”, or who are very receptive to learning such practices, but are more interested in getting their science done than in discussing the abstract merits of those practices or promoting their use to others. If you walk into a room and say, in the next three hours, I’m going to teach you version control, and there’s a good chance this could really help your research, probably quite a few people will be interested. If, on the other hand, you walk into the room and say, let me tell you how open science is going to revolutionize your research, and then proceed to either mention things that a sophisticated audience already knows, or blitz a naive audience with 20 different practices that you describe as all being part of open science, the reception is probably going to be frostier.

If your goal is to get people to implement good practices in their research—and I think that’s an excellent goal!—then it’s not so clear that much is gained by talking about open science as a movement, philosophy, culture, or even community (though I do think there are some advantages to the latter). It may be more effective to figure out who your audience is, what some of the low-hanging fruit are, and focus on those. Implying that there’s an all-or-none commitment—i.e., one is either an open scientist or not, and to be one, you have to buy into a whole bunch of practices and commitments—is often counterproductive.

The second problem with treating open science as a movement or identity is that the diversity of definitions and values I mentioned above almost inevitably leads to serious rifts within the broad open science community—i.e., between groups of people who would have little or no beef with one another if not for the mere fact that they all happen to identify as open scientists. If you spend any amount of time on social media following people whose biography includes the phrases “open science” or “open scientist”, you’ll probably know what I’m talking about. At a rough estimate, I’d guess that these days maybe 10 – 20% of tweets I see in my feed containing the words “open science” are part of some ongoing argument between people about what open science is, or who is and isn’t an open scientist, or what’s wrong with open science or open scientists—and not with substantive practices or applications at all.

I think it’s fair to say that most (though not all) of these arguments are, at root, about deep-seated differences in the kinds of values I mentioned earlier. People care about different things. Some people care deeply about making sure that studies can be accurately reproduced, and only secondarily or tertiarily about the diversity of the people producing those studies. Other people have the opposite priorities. Both groups of people (and there are of course many others) tend to think their particular value system properly captures what open science is (or should be) all about, and that the movement or community is being perverted or destroyed by some other group of people who, while perhaps well-intentioned (and sometimes even this modicum of charity is hard to find), just don’t have their heads screwed on quite straight.

This is not a new or special thing. Any time a large group of people with diverse values and interests find themselves all forced to sit under a single tent for a long period of time, divisions—and consequently, animosity—will eventually arise. If you’re forced to share limited resources or audience attention with a group of people who claim they fill the same role in society that you do, but who you disagree with on some important issues, odds are you’re going to experience conflict at some point.

Now, in some domains, these kinds of conflicts are truly unavoidable: the factors that introduce intra-group competition for resources, prestige, or attention are structural, and resolving them without ruining things for everyone is very difficult. In politics, for example, one’s nominal affiliation with a political party is legitimately kind of a big deal. In the United States, if a splinter group of disgruntled Republican politicians were to leave their party and start a “New Republican” party, they might achieve greater ideological purity and improve their internal social relations, but the new party’s members would also lose nearly all of their influence and power pretty much overnight. The same is, of course, true for disgruntled Democrats. The Nash equilibrium is, presently, for everyone to stay stuck in the same dysfunctional two-party system.

Open science, by contrast, doesn’t really have this problem. Or at least, it doesn’t have to have this problem. There’s an easy way out of the acrimony: people can just decide to deprecate vague, unhelpful terms like “open science” in favor of more informative and less controversial ones. I don’t think anything terrible is going to happen if someone who previously described themselves as an “open scientist” starts avoiding that term and instead opts to self-describe using more specific language. As I noted above, I speak from personal experience here (if you’re the kind of person who’s more swayed by personal anecdotes than by my ironclad, impregnable arguments). Five years ago, my talks and papers were liberally sprinkled with the term “open science”. For the last two or three years, I’ve largely avoided the term—and when I do use it, it’s often to make the same point I’m making here. E.g.,:

For the most part, I think I’ve succeeded in eliminating open science from my discourse in favor of more specific terms like reproducibility, transparency, diversity, etc. Which term I use depends on the context. I haven’t, so far, found myself missing the term “open”, and I don’t think I’ve lost brownie points in any club for not using it more often. I do, on the other hand, feel very confident that (a) I’ve managed to waste fewer people’s time by having to follow up vague initial statements about “open” things with more detailed clarifications, and (b) I get sucked into way fewer pointless Twitter arguments about what open science is really about (though admittedly the number is still not quite zero).

The prescription

So here’s my simple prescription for people who either identify as open scientists, or use the term on a regular basis: Every time you want to use the term open science—in your biography, talk abstracts, papers, tweets, conversation, or whatever else—pause and ask yourself if there’s another term you could substitute that would decrease ambiguity and avoid triggering never-ending terminological arguments. I’m not saying that the answer will always be yes. If you’re confident that the people you’re talking to have the same definition of open science as you, or you really do believe that nobody should ever call themselves an open scientist unless they use git, then godspeed—open science away. But I suspect that for most uses, there won’t be any such problem. In most instances, “open science” can be seamlessly replaced with something like “reproducibility”, “transparency”, “data sharing”, “being welcoming”, and so on. It’s a low-effort move, and the main effect of making the switch is that other people will have a clearer understanding of what you mean, and may be less inclined to argue with you about it.

Postscript

Some folks on twitter were concerned that this post makes it sound as if I’m passing off prior work and ideas as my own (particularly as relates to the role of diversity in open science). So let me explicitly state here that I don’t think any of the ideas expressed in this post are original to me in any way. I’ve heard most (if not all) expressed many times by many people in many contexts, and this post just represents my effort to distill them into a clear summary of my views.

Internal consistency is overrated, or How I learned to stop worrying and love shorter measures, Part I

[This is the first of a two-part series motivating and introducing precis, a Python package for automated abbreviation of psychometric measures. In part I, I motivate the search for shorter measures by arguing that internal consistency is highly overrated. In part II, I describe some software that makes it relatively easy to act on this newly-acquired disregard by gleefully sacrificing internal consistency at the altar of automated abbreviation. If you’re interested in this general topic but would prefer a slightly less ridiculous more academic treatment, read this paper with Hedwig Eisenbarth and Scott Lilienfeld, or take a look at look at the demo IPython notebook.]

Developing a new questionnaire measure is a tricky business. There are multiple objectives one needs to satisfy simultaneously. Two important ones are:

  • The measure should be reliable. Validity is bounded by reliability; a highly unreliable measure cannot support valid inferences, and is largely useless as a research instrument.
  • The measure should be as short as is practically possible. Time is money, and nobody wants to sit around filling out a 300-item measure if a 60-item version will do.

Unfortunately, these two objectives are in tension with one another to some degree. Random error averages out as one adds more measurements, so in practice, one of the easiest ways to increase the reliability of a measure is to simply add more items. From a reliability standpoint, it’s often better to have many shitty indicators of a latent construct than a few moderately reliable ones*. For example, Cronbach’s alpha–an index of the internal consistency of a measure–is higher for a 20-item measure with a mean inter-item correlation of 0.1 than for a 5-item measure with a mean inter-item correlation of 0.3.

Because it’s so easy to increase reliability just by adding items, reporting a certain level of internal consistency is now practically a requirement in order for a measure to be taken seriously. There’s a reasonably widespread view that an adequate level of reliability is somewhere around .8, and that anything below around .6 is just unacceptable. Perhaps as a consequence of this convention, researchers developing new questionnaires will typically include as many items as it takes to hit a “good” level of internal consistency. In practice, relatively few measures use fewer than 8 to 10 items to score each scale (though there are certainly exceptions, e.g., the Ten Item Personality Inventory). Not surprisingly, one practical implication of this policy is that researchers are usually unable to administer more than a handful of questionnaires to participants, because nobody has time to sit around filling out a dozen 100+ item questionnaires.

While understandable from one perspective, the insistence on attaining a certain level of internal consistency is also problematic. It’s easy to forget that while reliability may be necessary for validity, high internal consistency is not. One can have an extremely reliable measure that possesses little or no internal consistency. This is trivial to demonstrate by way of thought experiment. As I wrote in this post a few years ago:

Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable.

In fact, we can push this line of thought even further, and say that the perfect measure—in the sense of maximizing both reliability and brevity—should actually have an internal consistency of exactly zero. A value any higher than zero would imply the presence of redundancy between items, which in turn would suggest that we could (at least in theory, though typically not in practice) get rid of one or more items without reducing the amount of variance captured by the measure as a whole.

To use a spatial analogy, suppose we think of each of our measure’s items as a circle in a 2-dimensional space:

circles! we haz them.

Here, our goal is to cover the maximum amount of territory using the smallest number of circles (analogous to capturing as much variance in participant responses as possible using the fewest number of items). By this light, the solution in the above figure is kind of crummy, because it fails to cover much of the space despite having 20 circles to work with. The obvious problem is that there’s a lot of redundancy between the circles—many of them overlap in space. A more sensible arrangement, assuming we insisted on keeping all 20 circles, would look like this:

oOooo

In this case we get complete coverage of the target space just by realigning the circles to minimize overlap.

Alternatively, we could opt to cover more or less the same territory as the first arrangement, but using many fewer circles (in this case, 10):

abbreviated_layout

It turns out that what goes for our toy example in 2D space also holds for self-report measurement of psychological constructs that exist in much higher dimensions. For example, suppose we’re interested in developing a new measure of Extraversion, broadly construed. We want to make sure our measure covers multiple aspects of Extraversion—including sociability, increased sensitivity to reward, assertiveness, talkativeness, and so on. So we develop a fairly large item pool, and then we iteratively select groups of items that (a) have good face validity as Extraversion measures, (b) predict external criteria we think Extraversion should predict (predictive validity), and (c) tend to to correlate with each other modestly-to-moderately. At some point we end up with a measure that satisfies all of these criteria, and then presumably we can publish our measure and go on to achieve great fame and fortune.

So far, so good—we’ve done everything by the book. But notice something peculiar about the way the book would have us do things: the very fact that we strive to maintain reasonably solid correlations between our items actually makes our measurement approach much less efficient. To return to our spatial analogy, it amounts to insisting that our circles have to have a high degree of overlap, so that we know for sure that we’re actually measuring what we think we’re measuring. And to be fair, we do gain something for our trouble, in the sense that we can look at our little plot above and say, a-yup, we’re definitely covering that part of the space. But we also lose something, in that we waste a lot of items (or circles) trying to cover parts of the space that have already been covered by other items.

Why would we do something so inefficient? Well, the problem is that in the real world—unlike in our simple little 2D world—we don’t usually know ahead of time exactly what territory we need to cover. We probably have a fuzzy idea of our Extraversion construct, and we might have a general sense that, you know, we should include both reward-related and sociability-related items. But it’s not as if there’s a definitive and unambiguous answer to the question “what behaviors are part of the Extraversion construct?”. There’s a good deal of variation in human behavior that could in principle be construed as part of the latent Extraversion construct, but that in practice is likely to be overlooked (or deliberately omitted) by any particular measure of Extraversion. So we have to carefully explore the space. And one reasonable way to determine whether any given item within that space is still measuring Extraversion is to inspect its correlations with other items that we consider to be unambiguous Extraversion items. If an item correlates, say, 0.5 with items like “I love big parties” and “I constantly seek out social interactions”, there’s a reasonable case to be made that it measures at least some aspects of Extraversion. So we might decide to keep it in our measure. Conversely, if an item shows very low correlations with other putative Extraversion items, we might incline to throw it out.

Now, there’s nothing intrinsically wrong with this strategy. But what’s important to realize is that, once we’ve settled on a measure we’re happy with, there’s no longer a good reason to keep all of that redundancy hanging around. It may be useful when we first explore the territory, but as soon as we yell out FIN! and put down our protractors and levels (or whatever it is the kids are using to create new measures these days), it’s now just costing us time and money by making data collection less efficient. We would be better off saying something like, hey, now that we know what we’re trying to measure, let’s see if we can measure it equally well with fewer items. And at that point, we’re in the land of criterion-based measure development, where the primary goal is to predict some target criterion as accurately as possible, foggy notions of internal consistency be damned.

Unfortunately, committing ourselves fully to the noble and just cause of more efficient measurement still leaves open the question of just how we should go about eliminating items from our overly long measures. For that, you’ll have to stay tuned for Part II, wherein I use many flowery words and some concise Python code to try to convince you that this piece of software provides one reasonable way to go about it.

* On a tangential note, this is why traditional pre-publication peer review isn’t very effective, and is in dire need of replacement. Meta-analytic estimates put the inter-reviewer reliability across fields at around .2 to .3, and it’s rare to have more than two or three reviewers on a paper. No psychometrician would recommend evaluating people’s performance in high-stakes situations with just two items that have a ~.3 correlation, yet that’s how we evaluate nearly all of the scientific literature!

The reviewer’s dilemma, or why you shouldn’t get too meta when you’re supposed to be writing a review that’s already overdue

When I review papers for journals, I often find myself facing something of a tension between two competing motives. On the one hand, I’d like to evaluate each manuscript as an independent contribution to the scientific literature–i.e., without having to worry about how the manuscript stacks up against other potential manuscripts I could be reading. The rationale being that the plausibility of the findings reported in a manuscript shouldn’t really depend on what else is being published in the same journal, or in the field as a whole: if there are methodological problems that threaten the conclusions, they shouldn’t become magically more or less problematic just because some other manuscript has (or doesn’t have) gaping holes. Reviewing should simply be a matter of documenting one’s major concerns and suggestions and sending them back to the Editor for infallible judgment.

The trouble with this idea is that if you’re of a fairly critical bent, you probably don’t believe the majority of the findings reported in the manuscripts sent to you to review. Empirically, this actually appears to be the right attitude to hold, because as a good deal of careful work by biostatisticians like John Ioannidis shows, most published research findings are false, and most true associations are inflated. So, in some ideal world, where the job of a reviewer is simply to assess the likelihood that the findings reported in a paper provide an accurate representation of reality, and/or to identify ways of bringing those findings closer in line with reality, skepticism is the appropriate default attitude. Meaning, if you keep the question “why don’t I believe these results?” firmly in mind as you read through a paper and write your review, you probably aren’t going to go wrong all that often.

The problem is that, for better or worse, one’s job as a reviewer isn’t really–or at least, solely–to evaluate the plausibility of other people’s findings. In large part, it’s to evaluate the plausibility of reported findings in relation to the other stuff that routinely gets published in the same journal. For instance, if you regularly reviewing papers for a very low-tier journal, the editor is probably not going to be very thrilled to hear you say “well, Ms. Editor, none of the last 15 papers you’ve sent me are very good, so you should probably just shut down the journal.” So a tension arises between writing a comprehensive review that accurately captures what the reviewer really thinks about the results–which is often (at least in my case) something along the lines of “pffft, there’s no fucking way this is true”–and writing a review that weighs the merits of the reviewed manuscript relative to the other candidates for publication in the same journal.

To illustrate, suppose I review a paper and decide that, in my estimation, there’s only a 20% chance the key results reported in the paper would successfully replicate (for the sake of argument, we’ll pretend I’m capable of this level of precision). Should I recommend outright rejection? Maybe, since 1 in 5 odds of long-term replication don’t seem very good. But then again, what if 20% is actually better than average? What if I think the average article I’m sent to review only has a 10% chance of holding up over time? In that case, if I recommend rejection of the 20% article, and the editor follows my recommendation, most of the time I’ll actually be contributing to the journal publishing poorer quality articles than if I’d recommended accepting the manuscript, even if I’m pretty sure the findings reported in the manuscript are false.

Lest this sound like I’m needlessly overanalyzing the review process instead of buckling down and writing my own overdue reviews (okay, you’re right, now stop being a jerk), consider what happens when you scale the problem up. When journal editors send reviewers manuscripts to look over, the question they really want an answer to is, “how good is this paper compared to everything else that crosses my desk?” But most reviewers naturally incline to answer a somewhat different–and easier–question, namely, “in the grand scheme of life, the universe, and everything, how good is this paper?” The problem, then, is that if the variance in curmudgeonliness between reviewers exceeds the (reliable) variance within reviewers, then arguably the biggest factor in determining whether or not a given paper gets rejected is simply who happens to review it. Not how much expertise the reviewer has, or even how ‘good’ they are (in the sense that some reviewers are presumably better than others at identifying serious problems and overlooking trivial ones), but simply how critical they are on average. Which is to say, if I’m Reviewer 2 on your manuscript, you’ll probably have a better chance of rejection than if Reviewer 2 is someone who characteristically writes one-paragraph reviews that begin with the words “this is an outstanding and important piece of work…”

Anyway, on some level this is a pretty trivial observation; after all, we all know that the outcome of the peer review process is, to a large extent, tantamount to a roll of the dice. We know that there are cranky reviewers and friendly reviewers, and we often even have a sense of who they are, which is why we often suggest people to include or exclude as reviewers in our cover letters. The practical question though–and the reason for bringing this up here–is this: given that we have this obvious and ubiquitous problem of reviewers having different standards for what’s publishable, and that this undeniably impacts the outcome of peer review, are there any simple steps we could take to improve the reliability of the review process?

The way I’ve personally made peace between my desire to provide the most comprehensive and accurate review I can and the pragmatic need to evaluate each manuscript in relation to other manuscripts is to use the “comments to the Editor” box to provide some additional comments about my review. Usually what I end up doing is writing my review with little or no thought for practical considerations such as “how prestigious is this journal” or “am I a particularly harsh reviewer” or “is this a better or worse paper than most others in this journal”. Instead, I just write my review, and then when I’m done, I use the comments to the editor to say things like “I’m usually a pretty critical reviewer, so don’t take the length of my review as an indication I don’t like the manuscript, because I do,” or, “this may seem like a negative review, but it’s actually more positive than most of my reviews, because I’m a huge jerk.” That way I can appease my conscience by writing the review I want to while still giving the editor some indication as to where I fit in the distribution of reviewers they’re likely to encounter.

I don’t know if this approach makes any difference at all, and maybe editors just routinely ignore this kind of thing; it’s just the best solution I’ve come up with that I can implement all by myself, without asking anyone else to change their behavior. But if we allow ourselves to contemplate alternative approaches that include changes to the review process itself (while still adhering to the standard pre-publication review model, which, like many other people, I’ve argued is fundamentally dysfunctional), then there are many other possibilities.

One idea, for instance, would be to include calibration questions that could be used to estimate (and correct for) individual differences in curmudgeonliness. For instance, in addition to questions about the merit of the manuscript itself, the review form could have a question like “what proportion of articles you review do you estimate end up being rejected?” or “do you consider yourself a more critical or less critical reviewer than most of your peers?”

Another, logistically more difficult, idea would be to develop a centralized database of review outcomes, so that editors could see what proportion of each reviewer’s assignments ultimately end up being rejected (though they couldn’t see the actual content of the reviews). I don’t know if this type of approach would improve matters at all; it’s quite possible that the review process is fundamentally so inefficient and slow that editors just don’t have the time to spend worrying about this kind of thing. But it’s hard to believe that there aren’t some simple calibration steps we could take to bring reviewers into closer alignment with one another–even if we’re confined to working within the standard pre-publication model of peer review. And given the abysmally low reliability of peer review, even small improvements could potentially produce large benefits in the aggregate.

what aspirin can tell us about the value of antidepressants

There’s a nice post on Science-Based Medicine by Harriet Hall pushing back (kind of) against the increasingly popular idea that antidepressants don’t work. For context, there have been a couple of large recent meta-analyses that used comprehensive FDA data on clinical trials of antidepressants (rather than only published studies, which are biased towards larger, statistically significant, effects) to argue that antidepressants are of little or no use in mild or moderately-depressed people, and achieve a clinically meaningful benefit only in the severely depressed.

Hall points out that whether you think antidepressants have a clinically meaningful benefit or not depends on how you define clinically meaningful (okay, this sounds vacuous, but bear with me). Most meta-analyses of antidepressant efficacy reveal an effect size of somewhere between 0.3 and 0.5 standard deviations. Historically, psychologists consider effect sizes of 0.2, 0.5, and 0.8 standard deviations to be small, medium, and large, respectively. But as Hall points out:

The psychologist who proposed these landmarks [Jacob Cohen] admitted that he had picked them arbitrarily and that they had “no more reliable a basis than my own intuition.“ Later, without providing any justification, the UK’s National Institute for Health and Clinical Excellence (NICE) decided to turn the 0.5 landmark (why not the 0.2 or the 0.8 value?) into a one-size-fits-all cut-off for clinical significance.

She goes on to explain why this ultimately leaves the efficacy of antidepressants open to interpretation:

In an editorial published in the British Medical Journal (BMJ), Turner explains with an elegant metaphor: journal articles had sold us a glass of juice advertised to contain 0.41 liters (0.41 being the effect size Turner, et al. derived from the journal articles); but the truth was that the “glass“ of efficacy contained only 0.31 liters. Because these amounts were lower than the (arbitrary) 0.5 liter cut-off, NICE standards (and Kirsch) consider the glass to be empty. Turner correctly concludes that the glass is far from full, but it is also far from empty. He also points out that patients’ responses are not all-or-none and that partial responses can be meaningful.

I think this pretty much hits the nail on the head; no one really doubts that antidepressants work at this point; the question is whether they work well enough to justify their side effects and the social and economic costs they impose. I don’t have much to add to Hall’s argument, except that I think she doesn’t sufficiently emphasize how big a role scale plays when trying to evaluate the utility of antidepressants (or any other treatment). At the level of a single individual, a change of one-third of a standard deviation may not seem very big (then again, if you’re currently depressed, it might!). But on a societal scale, even canonically ‘small’ effects can have very large effects in the aggregate.

The example I’m most fond of here is Robert Rosenthal’s famous illustration of the effects of aspirin on heart attack. The correlation between taking aspirin daily and decreased risk of heart attack is, at best, .03 (I say at best because the estimate is based on a large 1988 study, but my understanding is that more recent studies have moderated even this small effect). In most domains of psychology, a correlation of .03 is so small as to be completely uninteresting. Most psychologists would never seriously contemplate running a study to try to detect an effect of that size. And yet, at a population level, even an r of .03 can have serious implications. Cast in a different light, what this effect means is that 3% of people who would be expected to have a heart attack without aspirin would be saved from that heart attack given a daily aspirin regimen. Needless to say, this isn’t trivial. It amounts to a potentially life-saving intervention for 30 out of every 1,000 people. At a public policy level, you’d be crazy to ignore something like that (which is why, for a long time, many doctors recommended that people take an aspirin a day). And yet, by the standards of experimental psychology, this is a tiny, tiny effect that probably isn’t worth getting out of bed for.

The point of course is that when you consider how many people are currently on antidepressants (millions), even small effects–and certainly an effect of one-third of a standard deviation–are going to be compounded many times over. Given that antidepressants demonstrably reduce the risk of suicide (according to Hall, by about 20%), there’s little doubt that tens of thousands of lives have been saved by antidepressants. That doesn’t necessarily justify their routine use, of course, because the side effects and costs also scale up to the societal level (just imagine how many millions of bouts of nausea could be prevented by eliminating antidepressants from the market!). The point is that just that, if you think the benefits of antidepressants outweigh their costs even slightly at the level of the average depressed individual, you’re probably committing yourself to thinking that they have a hugely beneficial impact at a societal level–and that holds true irrespective of whether the effects are ‘clinically meaningful’ by conventional standards.

some people are irritable, but everyone likes to visit museums: what personality inventories tell us about how we’re all just like one another

I’ve recently started recruiting participants for online experiments via Mechanical Turk. In the past I’ve always either relied on on directory listings (like this one) or targeted specific populations (e.g., bloggers and twitterers) via email solicitation. But recently I’ve started running a very large-sample decision-making study (it’s here, if you care to contribute to the sample), and waiting for participants to trickle in via directories isn’t cutting it. So I’ve started paying people (very) small amounts of money for participation.

One challenge I’ve had to deal with is figuring out how to filter out participants who aren’t really interested in contributing to science, and are strictly in it for the money. 20 or 30 cents is a pittance to most people in the developed world, but as I’ve found out the hard way, gaming MTurk appears to be a thriving business in some developing countries (some of which I’ve unfortunately had to resort to banning entirely). Cheaters aren’t so much of an issue for very quick tasks like providing individual ratings of faces, because (a) the time it takes to give a fake rating isn’t substantially greater than giving one’s actual opinion, and (b) the standards for what counts as accurate performance are clear, so it’s easy to train workers and weed out the bad apples. Unfortunately, my studies generally involve fairly long personality questionnaires combined with other cognitive tasks (e.g., in the current study, you get to repeatedly allocate hypothetical money between yourself and a computer partner, and rate some faces). They often take around half an hour, and involve 20+ questions per screen, so there’s a pretty big incentive for workers who are only in it for the cash to produce random responses and try to increase their effective wage. And the obvious question then is how to detect cheating in the data.

One of the techniques I’ve found works surprisingly well is to simply compare each person’s pattern of responses across items with the mean for the entire sample. In other words, you just compute the correlation between each individual’s item scores and the means for all the items scores across everyone who’s filled out the same measure. I know that there’s an entire literature on this stuff full of much more sophisticated ways to detect random responding, but I find this crude approach really does quite well (I’ve verified this by comparing it with a bunch of other similar metrics), and has the benefit of being trivial to implement.

Anyway, one of the things that surprised me when I first computed these correlations is just how strong the relationship between the sample mean and most individuals’ responses is. Here’s what the distribution looks like for one particular inventory, the 181-item Analog to Multiple Broadband Inventories (AMBI, whichI introduced in this paper, and discuss further here):

This is based on a sample of about 600 internet respondents, which actually turns out to be pretty representative of the broader population, as Sam Gosling, Simine Vazire, and Sanjay Srivastava will tell you (for what it’s worth, I’ve done the exact same analysis on a similar-sized off-line dataset from Lew Goldberg’s Eugene-Springfield Community Sample (check out that URL!) and obtained essentially the same results). In this sample, the median correlation is .48; so, in effect, you can predict a quarter of the variance in a typical participant’s responses without knowing anything at all about them. Human beings, it turns out, have some things in common with one another (who knew?). What you think you’re like is probably not very dissimilar to what I think I’m like. Which is kind of surprising, considering you’re a well-adjusted, friendly human being, and I’m a real freakshow somewhat eccentric, paranoid kind of guy.

What drives that similarity? Much of it probably has to do with social desirability–i.e., many of the AMBI items (and those on virtually all personality inventories) are evaluatively positive or negative statements that most people are inclined to strongly agree or disagree with. But it seems to be a particular kind of social desirability–one that has to do with openness to new experiences, and particular intellectual ones. For instance, here are the top 10 most endorsed items (based on mean likert scores across the entire sample; scores are in parentheses):

  1. like to read (4.62)
  2. like to visit new places (4.39)
  3. was a better than average student when I was in school (4.28)
  4. am a good listener (4.25)
  5. would love to explore strange places (4.22)
  6. am concerned about others (4.2)
  7. am open to new experiences (4.18)
  8. amuse my friends (4.16)
  9. love excitement (4.08)
  10. spend a lot of time reading (4.07)

And conversely, here are the 10 least-endorsed items:

  1. was a slow learner in school (1.52)
  2. don’t think that laws apply to me (1.8)
  3. do not like to visit museums (1.83)
  4. have difficulty imagining things (1.84)
  5. have no special urge to do something original (1.87)
  6. do not like art (1.95)
  7. feel little concern for others (1.97)
  8. don’t try to figure myself out (2.01)
  9. break my promises (2.01)
  10. make enemies (2.06)

You can see a clear evaluative component in both lists: almost everyone believes that they’re concerned about others and thinks that they’re smarter than average. But social desirability and positive illusions aren’t enough to explain these patterns, because there are plenty of other items on the AMBI that have an equally strong evaluative component–for instance, “don’t have much energy”, “cannot imagine lying or cheating”, “see myself as a good leader”, and “am easily annoyed”–yet have mean scores pretty close to the midpoint (in fact, the item ‘am easily annoyed’ is endorsed more highly than 107 of the 181 items!). So it isn’t just that we like to think and say nice things about ourselves; we’re willing to concede that we have some bad traits, but maybe not the ones that have to do with disliking cultural and intellectual experiences. I don’t have much of an idea as to why that might be, but it does introspectively feel to me like there’s more of a stigma about, say, not liking to visit new places or experience new things than admitting that you’re kind of an irritable person. Or maybe it’s just that many of the openness items can be interpreted more broadly than the other evaluative items–e.g., there are lots of different art forms, so almost everyone can endorse a generic “I like art” statement. I don’t really know.

Anyway, there’s nothing the least bit profound about any of this; if anything, it’s just a nice reminder that most of us are not really very good at evaluating where we stand in relation to other people, at least for many traits (for more on that, go read Simine Vazire’s work). The nominal midpoint on most personality scales is usually quite far from the actual median in the general population. This is a pretty big challenge for personality psychology, and if we could figure out how to get people to rank themselves more accurately relative to other people on self-report measures, that would be a pretty huge advance. But it seems quite likely that you just can’t do it, because people simply may not have introspective access to that kind of information.

Fortunately for our ability to measure individual differences in personality, there are plenty of items that do show considerable variance across individuals (actually, in fairness, even items with relatively low variance like the ones above can be highly discriminative if used properly–that’s what item response theory is for). Just for kicks, here are the 10 AMBI items with the largest standard deviations (in parentheses):

  1. disliked math in school (1.56)
  2. wanted to run away from home when I was a child (1.56)
  3. believe in a universal power or god (1.53)
  4. have felt contact with a divine power (1.51)
  5. rarely cry during sad movies (1.46)
  6. am able to fix electrical-wiring problems (1.46)
  7. am devoted to religion (1.44)
  8. shout or scream when I’m angry (1.43)
  9. love large parties (1.42)
  10. felt close to my parents when I was a child (1.42)

So now finally we come to the real moral of this post… that which you’ve read all this long way for. And the moral is this, grasshopper: if you want to successfully pick a fight at a large party, all you need to do is angrily yell at everyone that God told you math sucks.

Too much p = .048? Towards partial automation of scientific evaluation

Distinguishing good science from bad science isn’t an easy thing to do. One big problem is that what constitutes ‘good’ work is, to a large extent, subjective; I might love a paper you hate, or vice versa. Another problem is that science is a cumulative enterprise, and the value of each discovery is, in some sense, determined by how much of an impact that discovery has on subsequent work–something that often only becomes apparent years or even decades after the fact. So, to an uncomfortable extent, evaluating scientific work involves a good deal of guesswork and personal preference, which is probably why scientists tend to fall back on things like citation counts and journal impact factors as tools for assessing the quality of someone’s work. We know it’s not a great way to do things, but it’s not always clear how else we could do better.

Fortunately, there are many aspects of scientific research that don’t depend on subjective preferences or require us to suspend judgment for ten or fifteen years. In particular, methodological aspects of a paper can often be evaluated in a (relatively) objective way, and strengths or weaknesses of particular experimental designs are often readily discernible. For instance, in psychology, pretty much everyone agrees that large samples are generally better than small samples, reliable measures are better than unreliable measures, representative samples are better than WEIRD ones, and so on. The trouble when it comes to evaluating the methodological quality of most work isn’t so much that there’s rampant disagreement between reviewers (though it does happen), it’s that research articles are complicated products, and the odds of any individual reviewer having the expertise, motivation, and attention span to catch every major methodological concern in a paper are exceedingly small. Since only two or three people typically review a paper pre-publication, it’s not surprising that in many cases, whether or not a paper makes it through the review process depends as much on who happened to review it as on the paper itself.

A nice example of this is the Bem paper on ESP I discussed here a few weeks ago. I think most people would agree that things like data peeking, lumping and splitting studies, and post-hoc hypothesis testing–all of which are apparent in Bem’s paper–are generally not good research practices. And no doubt many potential reviewers would have noted these and other problems with Bem’s paper had they been asked to reviewer. But as it happens, the actual reviewers didn’t note those problems (or at least, not enough of them), so the paper was accepted for publication.

I’m not saying this to criticize Bem’s reviewers, who I’m sure all had a million other things to do besides pore over the minutiae of a paper on ESP (and for all we know, they could have already caught many other problems with the paper that were subsequently addressed before publication). The problem is a much more general one: the pre-publication peer review process in psychology, and many other areas of science, is pretty inefficient and unreliable, in the sense that it draws on the intense efforts of a very few, semi-randomly selected, individuals, as opposed to relying on a much broader evaluation by the community of researchers at large.

In the long term, the best solution to this problem may be to fundamentally rethink the way we evaluate scientific papers–e.g., by designing new platforms for post-publication review of papers (e.g., see this post for more on efforts towards that end). I think that’s far and away the most important thing the scientific community could do to improve the quality of scientific assessment, and I hope we ultimately will collectively move towards alternative models of review that look a lot more like the collaborative filtering systems found on, say, reddit or Stack Overflow than like peer review as we now know it. But that’s a process that’s likely to take a long time, and I don’t profess to have much of an idea as to how one would go about kickstarting it.

What I want to focus on here is something much less ambitious, but potentially still useful–namely, the possibility of automating the assessment of at least some aspects of research methodology. As I alluded to above, many of the factors that help us determine how believable a particular scientific finding is are readily quantifiable. In fact, in many cases, they’re already quantified for us. Sample sizes, p values, effect sizes,  coefficient alphas… all of these things are, in one sense or another, indices of the quality of a paper (however indirect), and are easy to capture and code. And many other things we care about can be captured with only slightly more work. For instance, if we want to know whether the authors of a paper corrected for multiple comparisons, we could search for strings like “multiple comparisons”, “uncorrected”, “Bonferroni”, and “FDR”, and probably come away with a pretty decent idea of what the authors did or didn’t do to correct for multiple comparisons. It might require a small dose of technical wizardry to do this kind of thing in a sensible and reasonably accurate way, but it’s clearly feasible–at least for some types of variables.

Once we extracted a bunch of data about the distribution of p values and sample sizes from many different papers, we could then start to do some interesting (and potentially useful) things, like generating automated metrics of research quality. For instance:

  • In multi-study articles, the variance in sample size across studies could tell us something useful about the likelihood that data peeking is going on (for an explanation as to why, see this). Other things being equal, an article with 9 studies with identical sample sizes is less likely to be capitalizing on chance than one containing 9 studies that range in sample size between 50 and 200 subjects (as the Bem paper does), so high variance in sample size could be used as a rough index for proclivity to peek at the data.
  • Quantifying the distribution of p values found in an individual article or an author’s entire body of work might be a reasonable first-pass measure of the amount of fudging (usually inadvertent) going on. As I pointed out in my earlier post, it’s interesting to note that with only one or two exceptions, virtually all of Bem’s statistically significant results come very close to p = .05. That’s not what you expect to see when hypothesis testing is done in a really principled way, because it’s exceedingly unlikely to think a researcher would be so lucky as to always just barely obtain the expected result. But a bunch of p = .03 and p = .048 results are exactly what you expect to find when researchers test multiple hypotheses and report only the ones that produce significant results.
  • The presence or absence of certain terms or phrases is probably at least slightly predictive of the rigorousness of the article as a whole. For instance, the frequent use of phrases like “cross-validated”, “statistical power”, “corrected for multiple comparisons”, and “unbiased” is probably a good sign (though not necessarily a strong one); conversely, terms like “exploratory”, “marginal”, and “small sample” might provide at least some indication that the reported findings are, well, exploratory.

These are just the first examples that come to mind; you can probably think of other better ones. Of course, these would all be pretty weak indicators of paper (or researcher) quality, and none of them are in any sense unambiguous measures. There are all sorts of situations in which such numbers wouldn’t mean much of anything. For instance, high variance in sample sizes would be perfectly justifiable in a case where researchers were testing for effects expected to have very different sizes, or conducting different kinds of statistical tests (e.g., detecting interactions is much harder than detecting main effects, and so necessitates larger samples). Similarly, p values close to .05 aren’t necessarily a marker of data snooping and fishing expeditions; it’s conceivable that some researchers might be so good at what they do that they can consistently design experiments that just barely manage to show what they’re intended to (though it’s not very plausible). And a failure to use terms like “corrected”, “power”, and “cross-validated” in a paper doesn’t necessarily mean the authors failed to consider important methodological issues, since such issues aren’t necessarily relevant to every single paper. So there’s no question that you’d want to take these kinds of metrics with a giant lump of salt.

Still, there are several good reasons to think that even relatively flawed automated quality metrics could serve an important purpose. First, many of the problems could be overcome to some extent through aggregation. You might not want to conclude that a particular study was poorly done simply because most of the reported p values were very close to .05; but if you were look at a researcher’s entire body of, say, thirty or forty published articles, and noticed the same trend relative to other researchers, you might start to wonder. Similarly, we could think about composite metrics that combine many different first-order metrics to generate a summary estimate of a paper’s quality that may not be so susceptible to contextual factors or noise. For instance, in the case of the Bem ESP article, a measure that took into account the variance in sample size across studies, the closeness of the reported p values to .05, the mention of terms like ‘one-tailed test’, and so on, would likely not have assigned Bem’s article a glowing score, even if each individual component of the measure was not very reliable.

Second, I’m not suggesting that crude automated metrics would replace current evaluation practices; rather, they’d be used strictly as a complement. Essentially, you’d have some additional numbers to look at, and you could choose to use them or not, as you saw fit, when evaluating a paper. If nothing else, they could help flag potential issues that reviewers might not be spontaneously attuned to. For instance, a report might note the fact that the term “interaction” was used several times in a paper in the absence of “main effect,” which might then cue a reviewer to ask, hey, why you no report main effects? — but only if they deemed it a relevant concern after looking at the issue more closely.

Third, automated metrics could be continually updated and improved using machine learning techniques. Given some criterion measure of research quality, one could systematically train and refine an algorithm capable of doing a decent job recapturing that criterion. Of course, it’s not clear that we really have any unobjectionable standard to use as a criterion in this kind of training exercise (which only underscores why it’s important to come up with better ways to evaluate scientific research). But a reasonable starting point might be to try to predict replication likelihood for a small set of well-studied effects based on the features of the original report. Could you for instance show, in an automated way, that initial effects reported in studies that failed to correct for multiple comparisons or reported p values closer to .05 were less likely to be subsequently replicated?

Of course, as always with this kind of stuff, the rub is that it’s easy to talk the talk and not so easy to walk the walk. In principle, we can make up all sorts of clever metrics, but in practice, it’s not trivial to automatically extract even a piece of information as seemingly simple as sample size from many papers (consider the difference between “Undergraduates (N = 15) participated…” and “Forty-two individuals diagnosed with depression and an equal number of healthy controls took part…”), let alone build sophisticated composite measures that could reasonably well approximate human judgments. It’s all well and good to write long blog posts about how fancy automated metrics could help separate good research from bad, but I’m pretty sure I don’t want to actually do any work to develop them, and you probably don’t either. Still, the potential benefits are clear, and it’s not like this is science fiction–it’s clearly viable on at least a modest scale. So someone should do it… Maybe Elsevier? Jorge Hirsch? Anyone? Bueller? Bueller?

some thoughtful comments on automatic measure abbreviation

In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:

Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he“ is you and “you“ is he because I was writing to Lew“¦)

::

1. So this is what it feels like to be replaced by a machine.

I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.

2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: http://ipip.ori.org/newNEO_FacetsTable.htm) against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.

This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed of  different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.

One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.

The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.

When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.

The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.

It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).

In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).

3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.

[BTW, that last sentence is the key one. I’m looking forward to seeing more of what you and others can do with this approach.]

When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):

big_five_peer

What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.

Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.