some people are irritable, but everyone likes to visit museums: what personality inventories tell us about how we’re all just like one another

I’ve recently started recruiting participants for online experiments via Mechanical Turk. In the past I’ve always either relied on on directory listings (like this one) or targeted specific populations (e.g., bloggers and twitterers) via email solicitation. But recently I’ve started running a very large-sample decision-making study (it’s here, if you care to contribute to the sample), and waiting for participants to trickle in via directories isn’t cutting it. So I’ve started paying people (very) small amounts of money for participation.

One challenge I’ve had to deal with is figuring out how to filter out participants who aren’t really interested in contributing to science, and are strictly in it for the money. 20 or 30 cents is a pittance to most people in the developed world, but as I’ve found out the hard way, gaming MTurk appears to be a thriving business in some developing countries (some of which I’ve unfortunately had to resort to banning entirely). Cheaters aren’t so much of an issue for very quick tasks like providing individual ratings of faces, because (a) the time it takes to give a fake rating isn’t substantially greater than giving one’s actual opinion, and (b) the standards for what counts as accurate performance are clear, so it’s easy to train workers and weed out the bad apples. Unfortunately, my studies generally involve fairly long personality questionnaires combined with other cognitive tasks (e.g., in the current study, you get to repeatedly allocate hypothetical money between yourself and a computer partner, and rate some faces). They often take around half an hour, and involve 20+ questions per screen, so there’s a pretty big incentive for workers who are only in it for the cash to produce random responses and try to increase their effective wage. And the obvious question then is how to detect cheating in the data.

One of the techniques I’ve found works surprisingly well is to simply compare each person’s pattern of responses across items with the mean for the entire sample. In other words, you just compute the correlation between each individual’s item scores and the means for all the items scores across everyone who’s filled out the same measure. I know that there’s an entire literature on this stuff full of much more sophisticated ways to detect random responding, but I find this crude approach really does quite well (I’ve verified this by comparing it with a bunch of other similar metrics), and has the benefit of being trivial to implement.

Anyway, one of the things that surprised me when I first computed these correlations is just how strong the relationship between the sample mean and most individuals’ responses is. Here’s what the distribution looks like for one particular inventory, the 181-item Analog to Multiple Broadband Inventories (AMBI, whichI introduced in this paper, and discuss further here):

This is based on a sample of about 600 internet respondents, which actually turns out to be pretty representative of the broader population, as Sam Gosling, Simine Vazire, and Sanjay Srivastava will tell you (for what it’s worth, I’ve done the exact same analysis on a similar-sized off-line dataset from Lew Goldberg’s Eugene-Springfield Community Sample (check out that URL!) and obtained essentially the same results). In this sample, the median correlation is .48; so, in effect, you can predict a quarter of the variance in a typical participant’s responses without knowing anything at all about them. Human beings, it turns out, have some things in common with one another (who knew?). What you think you’re like is probably not very dissimilar to what I think I’m like. Which is kind of surprising, considering you’re a well-adjusted, friendly human being, and I’m a real freakshow somewhat eccentric, paranoid kind of guy.

What drives that similarity? Much of it probably has to do with social desirability–i.e., many of the AMBI items (and those on virtually all personality inventories) are evaluatively positive or negative statements that most people are inclined to strongly agree or disagree with. But it seems to be a particular kind of social desirability–one that has to do with openness to new experiences, and particular intellectual ones. For instance, here are the top 10 most endorsed items (based on mean likert scores across the entire sample; scores are in parentheses):

  1. like to read (4.62)
  2. like to visit new places (4.39)
  3. was a better than average student when I was in school (4.28)
  4. am a good listener (4.25)
  5. would love to explore strange places (4.22)
  6. am concerned about others (4.2)
  7. am open to new experiences (4.18)
  8. amuse my friends (4.16)
  9. love excitement (4.08)
  10. spend a lot of time reading (4.07)

And conversely, here are the 10 least-endorsed items:

  1. was a slow learner in school (1.52)
  2. don’t think that laws apply to me (1.8)
  3. do not like to visit museums (1.83)
  4. have difficulty imagining things (1.84)
  5. have no special urge to do something original (1.87)
  6. do not like art (1.95)
  7. feel little concern for others (1.97)
  8. don’t try to figure myself out (2.01)
  9. break my promises (2.01)
  10. make enemies (2.06)

You can see a clear evaluative component in both lists: almost everyone believes that they’re concerned about others and thinks that they’re smarter than average. But social desirability and positive illusions aren’t enough to explain these patterns, because there are plenty of other items on the AMBI that have an equally strong evaluative component–for instance, “don’t have much energy”, “cannot imagine lying or cheating”, “see myself as a good leader”, and “am easily annoyed”–yet have mean scores pretty close to the midpoint (in fact, the item ‘am easily annoyed’ is endorsed more highly than 107 of the 181 items!). So it isn’t just that we like to think and say nice things about ourselves; we’re willing to concede that we have some bad traits, but maybe not the ones that have to do with disliking cultural and intellectual experiences. I don’t have much of an idea as to why that might be, but it does introspectively feel to me like there’s more of a stigma about, say, not liking to visit new places or experience new things than admitting that you’re kind of an irritable person. Or maybe it’s just that many of the openness items can be interpreted more broadly than the other evaluative items–e.g., there are lots of different art forms, so almost everyone can endorse a generic “I like art” statement. I don’t really know.

Anyway, there’s nothing the least bit profound about any of this; if anything, it’s just a nice reminder that most of us are not really very good at evaluating where we stand in relation to other people, at least for many traits (for more on that, go read Simine Vazire’s work). The nominal midpoint on most personality scales is usually quite far from the actual median in the general population. This is a pretty big challenge for personality psychology, and if we could figure out how to get people to rank themselves more accurately relative to other people on self-report measures, that would be a pretty huge advance. But it seems quite likely that you just can’t do it, because people simply may not have introspective access to that kind of information.

Fortunately for our ability to measure individual differences in personality, there are plenty of items that do show considerable variance across individuals (actually, in fairness, even items with relatively low variance like the ones above can be highly discriminative if used properly–that’s what item response theory is for). Just for kicks, here are the 10 AMBI items with the largest standard deviations (in parentheses):

  1. disliked math in school (1.56)
  2. wanted to run away from home when I was a child (1.56)
  3. believe in a universal power or god (1.53)
  4. have felt contact with a divine power (1.51)
  5. rarely cry during sad movies (1.46)
  6. am able to fix electrical-wiring problems (1.46)
  7. am devoted to religion (1.44)
  8. shout or scream when I’m angry (1.43)
  9. love large parties (1.42)
  10. felt close to my parents when I was a child (1.42)

So now finally we come to the real moral of this post… that which you’ve read all this long way for. And the moral is this, grasshopper: if you want to successfully pick a fight at a large party, all you need to do is angrily yell at everyone that God told you math sucks.

some thoughtful comments on automatic measure abbreviation

In the comments on my last post, Sanjay Srivastava had some excellent thoughts/concerns about the general approach of automating measure abbreviation using a genetic algorithm. They’re valid concerns that might come up for other people too, so I thought I’d discuss them here in more detail. Here’s Sanjay:

Lew Goldberg emailed me a copy of your paper a while back and asked what I thought of it. I’m pasting my response below — I’d be curious to hear your take on it. (In this email “he“ is you and “you“ is he because I was writing to Lew“¦)

::

1. So this is what it feels like to be replaced by a machine.

I’m not sure if Sanjay thinks this is a good or a bad thing? I guess my own feeling is that it’s a good thing to the extent that it makes personality measurement more efficient and frees researchers up to use that time (both during data collection and measure development) for other productive things like eating M&M’s on the couch and devising the most diabolically clever April Fool’s joke for next year to make up for the fact that you forgot to do it this year writing papers, and a bad one to the extent that people take this as a license to stop thinking carefully about what they’re doing when they’re shortening or administering questionnaire measures. But provided people retain a measure of skepticism and cautiousness in applying this type of approach, I’m optimistic that the result will be a large net gain.

2. The convergent correlations were a little low in studies 2 and 3. You’d expect shortened scales to have less reliability and validity, of course, but that didn’t go all the way in covering the difference. He explained that this was because the AMBI scales draw on a different item pool than the proprietary measures, which makes sense. wever, that makes it hard to evaluate the utility of the approach. If you compare how the full IPIP facet scales correlate with the proprietary NEO (which you’ve published here: http://ipip.ori.org/newNEO_FacetsTable.htm) against his Table 2, for example, it looks like the shortening algorithm is losing some information. Whether that’s better or worse than a rationally shortened scale is hard to say.

This is an excellent point, and I do want to reiterate that the abbreviation process isn’t magic; you can’t get something for free, and you’re almost invariably going to lose some fidelity in your measurement when you shorten any measure. That said, I actually feel pretty good about the degree of convergence I report in the paper. Sanjay already mentions one reason the convergent correlations seem lower than what you might expect: the new measures are composed of  different items than the old ones, so they’re not going to share many of the same sources of error. That means the convergent correlations will necessarily be lower, but isn’t necessarily a problem in a broader sense. But I think there are also two other, arguably more important, reasons why the convergence might seem deceptively low.

One is that the degree of convergence is bounded by the test-retest reliability of the original measures. Because the items in the IPIP pools were administered in batches spanning about a decade, whereas each of the proprietary measures (e.g., the NEO-PI-R) were administered on one occasion, the net result is that many of the items being used to predict personality traits were actually filled out several years before or after the personality measures in question. If you look at the long-term test-retest reliability of some of the measures I abbreviated (and there actually isn’t all that much test-retest data of that sort out there), it’s not clear that it’s much higher than what I report, even for the original measures. In other words, if you don’t generally see test-retest correlations across several years greater than .6 – .8 for the real NEO-PI-R scales, you can’t really expect to do any better with an abbreviated measure. But that probably says more about the reliability of narrowly-defined personality traits than about the abbreviation process.

The other reason the convergent correlations seem lower than you might expect, which I actually think is the big one, is that I reported only the cross-validated coefficients in the paper. In other words, I used only half of the data to abbreviate measures like the NEO-PI-R and HEXACO-PI, and then used the other half to obtain unbiased estimates of the true degree of convergence. This is technically the right way to do things, because if you don’t cross-validate, you’re inevitably going to capitalize on chance. If you use fit a model to a particular set of data, and then use the very same data to ask the question “how well does the model fit the data?” you’re essentially cheating–or, to put it more mildly, your estimates are going to be decidedly “optimistic”. You could argue it’s a relatively benign kind of cheating, because almost everyone does it, but that doesn’t make it okay from a technical standpoint.

When you look at it this way, the comparison of the IPIP representation of the NEO-PI-R with the abbreviated representation of the NEO-PI-R I generated in my paper isn’t really a fair one, because the IPIP measure Lew Goldberg came up with wasn’t cross-validated. Lew simply took the ten items that most strongly predicted each NEO-PI-R scale and grouped them together (with some careful rational inspection and modification, to be sure). That doesn’t mean there’s anything wrong with the IPIP measures; I’ve used them on multiple occasions myself, and have no complaints. They’re perfectly good measures that I think stand in really well for the (proprietary) originals. My point is just that the convergent correlations reported on the IPIP website are likely to be somewhat inflated relative to the truth.

The nice thing is that we can directly compare the AMBI (the measure I developed in my paper) with the IPIP version of the NEO-PI-R on a level footing by looking at the convergent correlations for the AMBI using only the training data. If you look at the validation (i.e., unbiased) estimates for the AMBI, which is what Sanjay’s talking about here, the mean convergent correlation for the 30 scales of the NEO-PI-R is .63, which is indeed much lower than the .73 reported for the IPIP version of the NEO-PI-R. Personally I’d still probably argue that .63 with 108 items is better than .73 with 300 items, but it’s a subjective question, and I wouldn’t disagree with anyone who preferred the latter. But again, the critical point is that this isn’t a fair comparison. If you make a fair comparison and look at the mean convergent correlation in the training data, it’s .69 for the AMBI, which is much closer to the IPIP data. Given that the AMBI version is just over 1/3rd the length of the IPIP version, I think the choice here becomes more clear-cut, and I doubt that there are many contexts where the (mean) difference between .69 and .73 would have meaningful practical implications.

It’s also worth remembering that nothing says you have to go with the 108-item measure I reported in the paper. The beauty of the GA approach is that you can quite easily generate a NEO-PI-R analog of any length you like. So if your goal isn’t so much to abbreviate the NEO-PI-R as to obtain a non-proprietary analog (and indeed, the IPIP version of the NEO-PI-R is actually longer than the NEO-PI-R, which contains 240 items), I think there’s a very good chance you could do better than the IPIP measure using substantially fewer than 300 items (but more than 108).

In fact, if you really had a lot of time on your hands, and wanted to test this question more thoroughly, what I think you’d want to do is run the GA with systematically varying item costs (i.e., you run the exact same procedure on the same data, but change the itemCost parameter a little bit each time). That way, you could actually plot out a curve showing you the degree of convergence with the original measure as a function of the length of the new measure (this is functionality I’d like to add to the GA code I released when I have the time, but probably not in the near future). I don’t really know what the sweet spot would be, but I can tell you from extensive experimentation that you get diminishing returns pretty quickly. In other words, I just don’t think you’re going to be able to get convergent correlations much higher than .7 on average (this only holds for the IPIP data, obviously; you might do much better using data collected over shorter timespans, or using subsets of items from the original measures). So in that sense, I like where I ended up (i.e., 108 items that still recapture the original quite well).

3. Ultimately I’d like to see a few substantive studies that run the GA-shortened scales alongside the original scales. The column-vector correlations that he reported were hard to evaluate — I’d like to see the actual predictions of behavior, not just summaries. But this seems like a promising approach.

[BTW, that last sentence is the key one. I’m looking forward to seeing more of what you and others can do with this approach.]

When I was writing the paper, I did initially want to include a supplementary figure showing the full-blown matrix of traits predicting the low-level behaviors Sanjay is alluding to (which are part of Goldberg’s massive dataset), but it seemed kind of daunting to present because there are 60 behavioral variables, and most of the correlations were very weak (not just for the AMBI measure–I mean they were weak for the original NEO-PI-R). So you would be looking at a 30 x 60 matrix full of mostly near-zero correlations, which seemed pretty uninformative. So to answer basically the same concern, what I did instead was show a supplementary figure showing a 30 x 5 matrix that captures the relation between the 30 facets of the NEO-PI-R and the Big Five as rated by participants’ peers (i.e., an independent measure of personality). Here’s that figure (click to enlarge):

big_five_peer

What I’m presenting is the same correlation matrix for three different versions of the NEO-PI-R: the AMBI version I generated (on the left), and the original (i.e., real) NEO-PI-R, for both the training and validation samples. The important point to note is that the pattern of correlations with an external set of criterion variables is very similar for all three measures. It isn’t identical of course, but you shouldn’t expect it to be. (In fact, if you look at the rightmost two columns, that gives you a sense of how you can get relatively different correlations even for exactly the same measure and subjects when the sample is randomly divided in two. That’s just sampling variability.) There are, in fairness, one or two blips where the AMBI version does something quite different (e..g, impulsiveness predicts peer-rated Conscientiousness for the AMBI version but not the other two). But overall, I feel pretty good about the AMBI measure when I look at this figure. I don’t think you’re losing very much in terms of predictive power or specificity, whereas I think you’re gaining a lot in time savings.

Having said all that, I couldn’t agree more with Sanjay’s final point, which is that the proof is really in the pudding (who came up with that expression? Bill Cosby?). I’ve learned the hard way that it’s really easy to come up with excellent theoretical and logical reasons for why something should or shouldn’t work, yet when you actually do the study to test your impeccable reasoning, the empirical results often surprise you, and then you’re forced to confront the reality that you’re actually quite dumb (and wrong). So it’s certainly possible that, for reasons I haven’t anticipated, something will go profoundly awry when people actually try to use these abbreviated measures in practice. And then I’ll have to delete this blog, change my name, and go into hiding. But I really don’t think that’s very likely. And I’m willing to stake a substantial chunk of my own time and energy on it (I’d gladly stake my reputation on it too, but I don’t really have one!); I’ve already started using these measures in my own studies–e.g., in a blogging study I’m conducting online here–with promising preliminary results. Ultimately, as with everything else, time will tell whether or not the effort is worth it.

what the general factor of intelligence is and isn’t, or why intuitive unitarianism is a lousy guide to the neurobiology of higher cognitive ability

This post shamelessly plagiarizes liberally borrows ideas from a much longer, more detailed, and just generally better post by Cosma Shalizi. I’m not apologetic, since I’m a firm believer in the notion that good ideas should be repeated often and loudly. So I’m going to be often and loud here, though I’ll try to be (slightly) more succinct than Shalizi. Still, if you have the time to spare, you should read his longer and more mathematical take.

There’s a widely held view among intelligence researchers in particular, and psychologists more generally, that there’s a general factor of intelligence (often dubbed g) that accounts for a very large portion of the variance in a broad range of cognitive performance tasks. Which is to say, if you have a bunch of people do a bunch of different tasks, all of which we think tap different aspects of intellectual ability, and then you take all those scores and factor analyze them, you’ll almost invariably get a first factor that explains 50% or more of the variance in the zero-order scores. Or to put it differently, if you know a person’s relative standing on g, you can make a reasonable prediction about how that person will do on lots of different tasks–for example, digit symbol substitution, N-back, go/no-go, and so on and so forth. Virtually all tasks that we think reflect cognitive ability turn out, to varying extents, to reflect some underlying latent variable, and that latent variable is what we dub g.

In a trivial sense, no one really disputes that there’s such a thing as g. You can’t really dispute the existence of g, seeing as a general factor tends to fall out of virtually all factor analyses of cognitive tasks; it’s about as well-replicated a finding as you can get. To say that g exists, on the most basic reading, is simply to slap a name on the empirical fact that scores on different cognitive measures tend to intercorrelate positively to a considerable extent.

What’s not so clear is what the implications of g are for our understanding of how the human mind and brain works. If you take the presence of g at face value, all it really says is what we all pretty much already know: some people are smarter than others. People who do well in one intellectual domain will tend to do pretty well in others too, other things being equal. With the exception of some people who’ve tried to argue that there’s no such thing as general intelligence, but only “multiple intelligences” that totally fractionate across domains (not a compelling story, if you look at the evidence), it’s pretty clear that cognitive abilities tend to hang together pretty well.

The trouble really crops up when we try to say something interesting about the architecture of the human mind on the basis of the psychometric evidence for g. If someone tells you that there’s a single psychometric factor that explains at least 50% of the variance in a broad range of human cognitive abilities, it seems perfectly reasonable to suppose that that’s because there’s some unitary intelligence system in people’s heads, and that that system varies in capacity across individuals. In other words, the two intuitive models people have about intelligence seem to be that either (a) there’s some general cognitive system that corresponds to g, and supports a very large portion of the complex reasoning ability we call “intelligence” or (b) there are lots of different (and mostly unrelated) cognitive abilities, each of which contributes only to specific types of tasks and not others. Framed this way, it just seems obvious that the former view is the right one, and that the latter view has been discredited by the evidence.

The problem is that the psychometric evidence for g stems almost entirely from statistical procedures that aren’t really supposed to be use for causal inference. The primary weapon in the intelligence researcher’s toolbox has historically been principal components analysis (PCA) or exploratory factor analysis, which are really just data reduction techniques. PCA tells you how you can describe your data in a more compact way, but it doesn’t actually tell you what structure is in your data. A good analogy is the use of digital compression algorithms. If you take a directory full of .txt files and compress them into a single .zip file, you’ll almost certainly end up with a file that’s only a small fraction of the total size of the original texts. The reason this works is because certain patterns tend to repeat themselves over and over in .txt files, and a smart algorithm will store an abbreviated description of those patterns rather than the patterns themselves. Which, conceptually, is almost exactly what happens when you run a PCA on a dataset: you’re searching for consistent patterns in the way observations vary along multiple variables, and discarding any redundancy you come across in favor of a more compact description.

Now, in a very real sense, compression is impressive. It’s certainly nice to be able to email your friend a 140kb .zip of your 1200-page novel rather than a 2mb .doc. But note that you don’t actually learn much from the compression. It’s not like your friend can open up that 140k binary representation of your novel, read it, and spare herself the torture of the other 1860kb. If you want to understand what’s going on in a novel, you need to read the novel and think about the novel. And if you want to understand what’s going on in a set of correlations between different cognitive tasks, you need to carefully inspect those correlations and carefully think about those correlations. You can run a factor analysis if you like, and you might learn something, but you’re not going to get any deep insights into the “true” structure of the data. The “true” structure of the data is, by definition, what you started out with (give or take some error). When you run a PCA, you actually get a distorted (but simpler!) picture of the data.

To most people who use PCA, or other data reduction techniques, this isn’t a novel insight by any means. Most everyone who uses PCA knows that in an obvious sense you’re distorting the structure of the data when you reduce its dimensionality. But the use of data reduction is often defended by noting that there must be some reason why variables hang together in such a way that they can be reduced to a much smaller set of variables with relatively little loss of variance. In the context of intelligence, the intuition can be expressed as: if there wasn’t really a single factor underlying intelligence, why would we get such a strong first factor? After all, it didn’t have to turn out that way; we could have gotten lots of smaller factors that appear to reflect distinct types of ability, like verbal intelligence, spatial intelligence, perceptual speed, and so on. But it did turn out that way, so that tells us something important about the unitary nature of intelligence.

This is a strangely compelling argument, but it turns out to be only minimally true. What the presence of a strong first factor does tell you is that you have a lot of positively correlated variables in your data set. To be fair, that is informative. But it’s only minimally informative, because, assuming you eyeballed the correlation matrix in the original data, you already knew that.

What you don’t know, and can’t know, on the basis of a PCA, is what underlying causal structure actually generated the observed positive correlations between your variables. It’s certainly possible that there’s really only one central intelligence system that contributes the bulk of the variance to lots of different cognitive tasks. That’s the g model, and it’s entirely consistent with the empirical data. Unfortunately, it’s not the only one. To the contrary, there are an infinite number of possible causal models that would be consistent with any given factor structure derived from a PCA, including a structure dominated by a strong first factor. In fact, you can have a causal structure with as many variables as you like be consistent with g-like data. So long as the variables in your model all make contributions in the same direction to the observed variables, you will tend to end up with an excessively strong first factor. So you could in principle have 3,000 distinct systems in the human brain, all completely independent of one another, and all of which contribute relatively modestly to a bunch of different cognitive tasks. And you could still get a first factor that accounts for 50% or more of the variance. No g required.

If you doubt this is true, go read Cosma Shalizi’s post, where he not only walks you through a more detailed explanation of the mathematical necessity of this claim, but also illustrates the point using some very simple simulations. Basically, he builds a toy model in which 11 different tasks each draw on several hundred underlying cognitive tasks, which are turn drawn from a larger pool of 2,766 completely independent abilities. He then runs a PCA on the data and finds, lo and behold, a single factor that explains nearly 50% of the variance in scores. Using PCA, it turns out, you can get something huge from (almost) nothing.

Now, at this point a proponent of a unitary g might say, sure, it’s possible that there isn’t really a single cognitive system underlying variation in intelligence; but it’s not plausible, because it’s surely more parsimonious to posit a model with just one variable than a model with 2,766. But that’s only true if you think that our brains evolved in order to make life easier for psychometricians, which, last I checked, wasn’t the case. If you think even a little bit about what we know about the biological and genetic bases of human cognition, it starts to seem really unlikely that there really could be a single central intelligence system. For starters, the evidence just doesn’t support it. In the cognitive neuroscience literature, for example, biomarkers of intelligence abound, and they just don’t seem all that related. There’s a really nice paper in Nature Reviews Neuroscience this month by Deary, Penke, and Johnson that reviews a substantial portion of the literature of intelligence; the upshot is that intelligence has lots of different correlates. For example, people who score highly on intelligence tend to (a) have larger brains overall; (b) show regional differences in brain volume; (c) show differences in neural efficiency when performing cognitive tasks; (d) have greater white matter integrity; (e) have brains with more efficient network structures;  and so on.

These phenomena may not all be completely independent, but it’s hard to believe there’s any plausible story you could tell that renders them all part of some unitary intelligence system, or subject to unitary genetic influence. And really, why should they be part of a unitary system? Is there really any reason to think there has to be a single rate-limiting factor on performance? It’s surely perfectly plausible (I’d argue, much more plausible) to think that almost any complex cognitive task you use as an index of intelligence is going to draw on many, many different cognitive abilities. Take a trivial example: individual differences in visual acuity probably make a (very) small contribution to performance on many different cognitive tasks. If you can’t see the minute details of the stimuli as well as the next person, you might perform slightly worse on the task. So some variance in putatively “cognitive” task performance undoubtedly reflects abilities that most intelligence researchers wouldn’t really consider properly reflective of higher cognition at all. And yet, that variance has to go somewhere when you run a factor analysis. Most likely, it’ll go straight into that first factor, or g, since it’s variance that’s common to multiple tasks (i.e., someone with poorer eyesight may tend to do very slightly worse on any task that requires visual attention). In fact, any ability that makes unidirectional contributions to task performance, no matter how relevant or irrelevant to the conceptual definition of intelligence, will inflate the so-called g factor.

If this still seems counter-intuitive to you, here’s an analogy that might, to borrow Dan Dennett’s phrase, prime your intuition pump (it isn’t as dirty as it sounds). Imagine that instead of studying the relationship between different cognitive tasks, we decided to study the relation between performance at different sports. So we went out and rounded up 500 healthy young adults and had them engage in 16 different sports, including basketball, soccer, hockey, long-distance running, short-distance running, swimming, and so on. We then took performance scores for all 16 tasks and submitted them to a PCA. What do you think would happen? I’d be willing to bet good money that you’d get a strong first factor, just like with cognitive tasks. In other words, just like with g, you’d have one latent variable that seemed to explain the bulk of the variance in lots of different sports-related abilities. And just like g, it would have an easy and parsimonious interpretation: a general factor of athleticism!

Of course, in a trivial sense, you’d be right to call it that. I doubt anyone’s going to deny that some people just are more athletic than others. But if you then ask, “well, what’s the mechanism that underlies athleticism,” it’s suddenly much less plausible to think that there’s a single physiological variable or pathway that supports athleticism. In fact, it seems flatly absurd. You can easily think of dozens if not hundreds of factors that should contribute a small amount of the variance to performance on multiple sports. To name just a few: height, jumping ability, running speed, oxygen capacity, fine motor control, gross motor control, perceptual speed, response time, balance, and so on and so forth. And most of these are individually still relatively high-level abilities that break down further at the physiological level (e.g., “balance” is itself a complex trait that at minimum reflects contributions of the vestibular, visual, and cerebellar systems, and so on.). If you go down that road, it very quickly becomes obvious that you’re just not going to find a unitary mechanism that explains athletic ability. Because it doesn’t exist.

All of this isn’t to say that intelligence (or athleticism) isn’t “real”. Intelligence and athleticism are perfectly real; it makes complete sense, and is factually defensible, to talk about some people being smarter or more athletic than other people. But the point is that those judgments are based on superficial observations of behavior; knowing that people’s intelligence or athleticism may express itself in a (relatively) unitary fashion doesn’t tell you anything at all about the underlying causal mechanisms–how many of them there are, or how they interact.

As Cosma Shalizi notes, it also doesn’t tell you anything about heritability or malleability. The fact that we tend to think intelligence is highly heritable doesn’t provide any evidence in favor of a unitary underlying mechanism; it’s just as plausible to think that there are many, many individual abilities that contribute to complex cognitive behavior, all of which are also highly heritable individually. Similarly, there’s no reason to think our cognitive abilities would be any less or any more malleable depending on whether they reflect the operation of a single system or hundreds of variables. Regular physical exercise clearly improves people’s capacity to carry out all sorts of different activities, but that doesn’t mean you’re only training up a single physiological pathway when you exercise; a whole host of changes are taking place throughout your body.

So, assuming you buy the basic argument, where does that leave us? Depends. From a day-to-day standpoint, nothing changes. You can go on telling your friends that so-and-so is a terrific athlete but not the brightest crayon in the box, and your friends will go on understanding exactly what you meant. No one’s suggesting that intelligence isn’t stable and trait-like, just that, at the biological level, it isn’t really one stable trait.

The real impact of relaxing the view that g is a meaningful construct at the biological level, I think, will be in removing an artificial and overly restrictive constraint on researchers’ theorizing. The sense I get, having done some work on executive control, is that g is the 800-pound gorilla in the room: researchers interested in studying the neural bases of intelligence (or related constructs like executive or cognitive control) are always worrying about how their findings relate to g, and how to explain the fact that there might be dissociable neural correlates of different abilities (or even multiple independent contributions to fluid intelligence). To show you that I’m not making this concern up, and that it weighs heavily on many researchers, here’s a quote from the aforementioned and otherwise really excellent NRN paper by Deary et al reviewing recent findings on the neural bases of intelligence:

The neuroscience of intelligence is constrained by — and must explain — the following established facts about cognitive test performance: about half of the variance across varied cognitive tests is contained in general cognitive ability; much less variance is contained within broad domains of capability; there is some variance in specific abilities; and there are distinct ageing patterns for so-called fluid and crystallized aspects of cognitive ability.

The existence of g creates a complicated situation for neuroscience. The fact that g contributes substantial variance to all specific cognitive ability tests is generally thought to indicate that g contributes directly in some way to performance on those tests. That is, when domains of thinking skill (such as executive function and memory) or specific tasks (such as mental arithmetic and non-verbal reasoning on the Raven’s Progressive Matrices test) are studied, neuroscientists are observing brain activity related to g as well as the specific task activities. This undermines the ability to determine localized brain activities that are specific to the task at hand.

I hope I’ve convinced you by this point that the neuroscience of intelligence doesn’t have to explain why half of the variance is contained in general cognitive ability, because there’s no good evidence that there is such a thing as general cognitive ability (except in the descriptive psychometric sense, which carries no biological weight). Relaxing this artificial constraint would allow researchers to get on with the interesting and important business of identifying correlates (and potential causal determinants) of different cognitive abilities without having to worry about the relation of their finding to some Grand Theory of Intelligence. If you believe in g, you’re going to be at a complete loss to explain how researchers can continually identify new biological and genetic correlates of intelligence, and how the effect sizes could be so small (particularly at a genetic level, where no one’s identified a single polymorphism that accounts for more than a fraction of the observable variance in intelligence–the so called problem of “missing heritability”). But once you discard the fiction of g, you can take such findings in stride, and can set about the business of building integrative models that allow for and explicitly model the presence of multiple independent contributions to intelligence. And if studying the brain has taught us anything at all, it’s that the truth is inevitably more complicated than what we’d like to believe.

functional MRI and the many varieties of reliability

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”” paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a “˜reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered “˜excellent’, results between 0.59 and 0.75 be considered “˜good’, results between .40 and .58 be considered “˜fair’, and results lower than 0.40 be considered “˜poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences