Internal consistency is overrated, or How I learned to stop worrying and love shorter measures, Part I

[This is the first of a two-part series motivating and introducing precis, a Python package for automated abbreviation of psychometric measures. In part I, I motivate the search for shorter measures by arguing that internal consistency is highly overrated. In part II, I describe some software that makes it relatively easy to act on this newly-acquired disregard by gleefully sacrificing internal consistency at the altar of automated abbreviation. If you’re interested in this general topic but would prefer a slightly less ridiculous more academic treatment, read this paper with Hedwig Eisenbarth and Scott Lilienfeld, or take a look at look at the demo IPython notebook.]

Developing a new questionnaire measure is a tricky business. There are multiple objectives one needs to satisfy simultaneously. Two important ones are:

  • The measure should be reliable. Validity is bounded by reliability; a highly unreliable measure cannot support valid inferences, and is largely useless as a research instrument.
  • The measure should be as short as is practically possible. Time is money, and nobody wants to sit around filling out a 300-item measure if a 60-item version will do.

Unfortunately, these two objectives are in tension with one another to some degree. Random error averages out as one adds more measurements, so in practice, one of the easiest ways to increase the reliability of a measure is to simply add more items. From a reliability standpoint, it’s often better to have many shitty indicators of a latent construct than a few moderately reliable ones*. For example, Cronbach’s alpha–an index of the internal consistency of a measure–is higher for a 20-item measure with a mean inter-item correlation of 0.1 than for a 5-item measure with a mean inter-item correlation of 0.3.

Because it’s so easy to increase reliability just by adding items, reporting a certain level of internal consistency is now practically a requirement in order for a measure to be taken seriously. There’s a reasonably widespread view that an adequate level of reliability is somewhere around .8, and that anything below around .6 is just unacceptable. Perhaps as a consequence of this convention, researchers developing new questionnaires will typically include as many items as it takes to hit a “good” level of internal consistency. In practice, relatively few measures use fewer than 8 to 10 items to score each scale (though there are certainly exceptions, e.g., the Ten Item Personality Inventory). Not surprisingly, one practical implication of this policy is that researchers are usually unable to administer more than a handful of questionnaires to participants, because nobody has time to sit around filling out a dozen 100+ item questionnaires.

While understandable from one perspective, the insistence on attaining a certain level of internal consistency is also problematic. It’s easy to forget that while reliability may be necessary for validity, high internal consistency is not. One can have an extremely reliable measure that possesses little or no internal consistency. This is trivial to demonstrate by way of thought experiment. As I wrote in this post a few years ago:

Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable.

In fact, we can push this line of thought even further, and say that the perfect measure—in the sense of maximizing both reliability and brevity—should actually have an internal consistency of exactly zero. A value any higher than zero would imply the presence of redundancy between items, which in turn would suggest that we could (at least in theory, though typically not in practice) get rid of one or more items without reducing the amount of variance captured by the measure as a whole.

To use a spatial analogy, suppose we think of each of our measure’s items as a circle in a 2-dimensional space:

circles! we haz them.

Here, our goal is to cover the maximum amount of territory using the smallest number of circles (analogous to capturing as much variance in participant responses as possible using the fewest number of items). By this light, the solution in the above figure is kind of crummy, because it fails to cover much of the space despite having 20 circles to work with. The obvious problem is that there’s a lot of redundancy between the circles—many of them overlap in space. A more sensible arrangement, assuming we insisted on keeping all 20 circles, would look like this:


In this case we get complete coverage of the target space just by realigning the circles to minimize overlap.

Alternatively, we could opt to cover more or less the same territory as the first arrangement, but using many fewer circles (in this case, 10):


It turns out that what goes for our toy example in 2D space also holds for self-report measurement of psychological constructs that exist in much higher dimensions. For example, suppose we’re interested in developing a new measure of Extraversion, broadly construed. We want to make sure our measure covers multiple aspects of Extraversion—including sociability, increased sensitivity to reward, assertiveness, talkativeness, and so on. So we develop a fairly large item pool, and then we iteratively select groups of items that (a) have good face validity as Extraversion measures, (b) predict external criteria we think Extraversion should predict (predictive validity), and (c) tend to to correlate with each other modestly-to-moderately. At some point we end up with a measure that satisfies all of these criteria, and then presumably we can publish our measure and go on to achieve great fame and fortune.

So far, so good—we’ve done everything by the book. But notice something peculiar about the way the book would have us do things: the very fact that we strive to maintain reasonably solid correlations between our items actually makes our measurement approach much less efficient. To return to our spatial analogy, it amounts to insisting that our circles have to have a high degree of overlap, so that we know for sure that we’re actually measuring what we think we’re measuring. And to be fair, we do gain something for our trouble, in the sense that we can look at our little plot above and say, a-yup, we’re definitely covering that part of the space. But we also lose something, in that we waste a lot of items (or circles) trying to cover parts of the space that have already been covered by other items.

Why would we do something so inefficient? Well, the problem is that in the real world—unlike in our simple little 2D world—we don’t usually know ahead of time exactly what territory we need to cover. We probably have a fuzzy idea of our Extraversion construct, and we might have a general sense that, you know, we should include both reward-related and sociability-related items. But it’s not as if there’s a definitive and unambiguous answer to the question “what behaviors are part of the Extraversion construct?”. There’s a good deal of variation in human behavior that could in principle be construed as part of the latent Extraversion construct, but that in practice is likely to be overlooked (or deliberately omitted) by any particular measure of Extraversion. So we have to carefully explore the space. And one reasonable way to determine whether any given item within that space is still measuring Extraversion is to inspect its correlations with other items that we consider to be unambiguous Extraversion items. If an item correlates, say, 0.5 with items like “I love big parties” and “I constantly seek out social interactions”, there’s a reasonable case to be made that it measures at least some aspects of Extraversion. So we might decide to keep it in our measure. Conversely, if an item shows very low correlations with other putative Extraversion items, we might incline to throw it out.

Now, there’s nothing intrinsically wrong with this strategy. But what’s important to realize is that, once we’ve settled on a measure we’re happy with, there’s no longer a good reason to keep all of that redundancy hanging around. It may be useful when we first explore the territory, but as soon as we yell out FIN! and put down our protractors and levels (or whatever it is the kids are using to create new measures these days), it’s now just costing us time and money by making data collection less efficient. We would be better off saying something like, hey, now that we know what we’re trying to measure, let’s see if we can measure it equally well with fewer items. And at that point, we’re in the land of criterion-based measure development, where the primary goal is to predict some target criterion as accurately as possible, foggy notions of internal consistency be damned.

Unfortunately, committing ourselves fully to the noble and just cause of more efficient measurement still leaves open the question of just how we should go about eliminating items from our overly long measures. For that, you’ll have to stay tuned for Part II, wherein I use many flowery words and some concise Python code to try to convince you that this piece of software provides one reasonable way to go about it.

* On a tangential note, this is why traditional pre-publication peer review isn’t very effective, and is in dire need of replacement. Meta-analytic estimates put the inter-reviewer reliability across fields at around .2 to .3, and it’s rare to have more than two or three reviewers on a paper. No psychometrician would recommend evaluating people’s performance in high-stakes situations with just two items that have a ~.3 correlation, yet that’s how we evaluate nearly all of the scientific literature!

7 thoughts on “Internal consistency is overrated, or How I learned to stop worrying and love shorter measures, Part I”

  1. Very interesting post! I am looking forward to Part II.

    Btw, you may also enjoy Ryne Sherman’s take on issues of reliability “versus” validity here:

    Your thought experiment (on shoelace-tying and blue-preference) got me thinking:
    I’m inclined to suggest it would be good not to use the term “reliability” for so many different things (e.g., split-half, internal consistency, retest, parallel-test) and specify what is exactly meant. For example, internal consistency (most often embodied in Cronbach’s alpha) is akin to a measure of scale homogeneity (not unidimensionality, though!) – that is, how well I am covering the same (narrow) space of a construct. In contrast, what we often want with “reliability” is measurement error to be absent or at least as reduced as possible (i.e., we are sampling “true scores” only). Because most psychological measurement contains some error, we’ve grown accustomed (in CTT) to just sample again and again and again the same thing. That does tend to get rid of some measurement error, but also decreases the scope of what is measured (leading to more homogeneity -> higher inter-item correlation -> higher internal consistencies). In contrast, your thought experiment illustrates how heterogeneous items/content may be “reliable” (sensu free of measurement error), but not “homogeneous” (sensu internal consistency). Thus, it appears, reliability and homogeneity could be, in theory, orthogonal. (I realize, though, that things aren’t that easy …)
    Interestingly, the internal consistency dictum may be traced back to our love in CTT for reflective models where one latent factor presumably “produces” correlations among its manifest items (as the sole causal force behind those correlations). Those models must assume (high) inter-item correlations when the latent variance is not partialled out. In contrast, formative models do not have this requirement and seem more in line with an orthogonal concept of reliability and homogeneity. There, manifest items need not correlate; they just form a latent index (based on theoretical considerations, for example).

    I’m curious: Would you endorse using different terminology (not just the blanket term “reliability”)?

    1. Hi John,

      Yes, I saw Ryne’s post–we corresponded a bit before he put it up.

      I agree with you in principle that the term “reliability” is problematic. But in practice I’m not sure what can be done about it. The thing we care about, as you say, is that our measures have as little error as possible. But the difficulty is that the definition of error is dependent on the analysis we’re doing. For example, if you’re interested in the stability of personality traits over time, then any state influences on your measure should probably be considered error (to the degree they’re uncorrelated with stable trait differences). But in another context, you might want to lump the same variance in as reliable signal (e.g., if you’re interested in the effect of transient mood on cognitive performance).

      I don’t see any way around this problem short of being very explicit about what you consider the signal of interest in any given application. I’m not sure that talking about specific estimates of reliability (e.g., test-retest, internal consistency, etc.) solves the problem, because you end up in the same position with any of those–i.e., you still have to specify what the relationship is between any given estimate and the signal you care about. But if there are better ways to talk about these things (informally I mean–a formal specification of the model is still probably best) then I’m all for it!

      1. It’s fair to characterize Generalizability Theory as an attempt to clarify exactly these issues. There we don’t have a unitary notion of “reliability” but instead different “generalizabilities” defined w.r.t. different inferential goals. It’s interesting.

  2. Trying to import precis, get an error:

    —> 11 from base import Dataset, Measure, AbbreviatedMeasure
    ImportError: No module named ‘base’

    1. Hi Alexander,

      If you tried something like “from base import Dataset”, it won’t work because base is a module under the precis namespace, and isn’t globally available. Try “from precis.base import Dataset” and that should work. Actually, “from precis import Dataset” should also work. See the demo IPython notebook for examples.

      Also, please post code-related issues to the GitHub repo (i.e., here). Thanks!

  3. I’m late to the game, but I just stumbled on this article and thought you might be interested by this article by Bollen and Lennox (1993), where they use structural equation models to consider many of the issues you bring up:

    One of the biggest points of contact between the article and your post is where they consider effect models (latent variable shared across measures) and causal indicator models (latent variable is linear function of indicators). In the latter, the indicators need not be correlated–and in fact, just as you describe in your blog, are often better off being uncorrelated (and internal consistency isn’t a sensible measure).

    Another note is that Cronbach’s alpha is not a measure of internal consistency, but the average of all possible split-half reliability estimates (see (although on second pass, I think you get at this).

Leave a Reply