abbreviating personality measures in R: a tutorial

A while back I blogged about a paper I wrote that uses genetic algorithms to abbreviate personality measures with minimal human intervention. In the paper, I promised to put the R code I used online, so that other people could download and use it. I put off doing that for a long time, because the code was pretty much spaghetti by the time the paper got accepted, and there are any number of things I’d rather do than spend a weekend rewriting my own code. But one of the unfortunate things about publicly saying that you’re going to do something is that you eventually have to do that something. So, since the paper was published in JRP last week, and several people have emailed me to ask for the code, I spent much of the weekend making the code presentable. It’s not a fully-formed R package yet, but it’s mostly legible, and seems to work more or less ok. You can download the file (gaabbreviate.R) here. The rest of this (very long) post is basically a tutorial on how to use the code, so you probably want to stop reading this now unless you have a burning interest in personality measurement.

Prerequisites and installation

Although you won’t need to know much R to follow this tutorial, you will need to have R installed on your system. Fortunately, R is freely available for all major operating systems. You’ll also need the genalg and psych packages for R, because gaabbreviate won’t run without them. Once you have R installed, you can download and install those packages like so:

install.packages(c(‘genalg’, ‘psych’))

Once that’s all done, you’re ready to load gaabbreviate.R:

source(“/path/to/the/file/gaabbreviate.R”)

…where you make sure to specify the right path to the location where you saved the file. And that’s it! Now you’re ready to abbreviate measures.

Reading in data

The file contains several interrelated functions, but the workhorse is gaa.abbreviate(), which takes a set of item scores and scale scores for a given personality measure as input and produces an abbreviated version of the measure, along with a bunch of other useful information. In theory, you can go from old data to new measure in a single line of R code, with almost no knowledge of R required (though I think it’s a much better idea to do it step-by-step and inspect the results at every stage to make sure you know what’s going on).

The abbreviation function is pretty particular about the format of the input it expects. It takes two separate matrices, one with item scores, the other with scale scores (a scale here just refers to any set of one or more items used to generate a composite score). Subjects are in rows, item or scale scores are in columns. So for example, let’s say you have data from 3 subjects, who filled out a personality measure that has two separate scales, each composed of two items. Your item score matrix might look like this:

3 5 1 1

2 2 4 1

2 4 5 5

…which you could assign in R like so:

items = matrix(c(3,2,2,5,2,2,1,4,5,1,1,5), ncol=3)

I.e., the first subject had scores of 3, 5, 1, and 1 on the four items, respectively; the second subject had scores of 2, 2, 4, and 1… and so on.

Based on the above, if you assume items 1 and 2 constitute one scale, and items 3 and 4 constitute the other, the scale score matrix would be:

8 2

4 5

6 10

Of course, real data will probably have hundreds of subjects, dozens of items, and a bunch of different scales, but that’s the basic format. Assuming you can get your data into an R matrix or data frame, you can feed it directly to gaa.abbreviate() and it will hopefully crunch your data without complaining. But if you don’t want to import your data into R before passing it to the code, you can also pass filenames as arguments instead of matrices. For example:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100)

If you pass files instead of data, the referenced text files must be tab-delimited, with subjects in rows, item/scale scores in columns, and a header row that gives the names of the columns (i.e., item names and scale names; these can just be numbers if you like, but they have to be there). Subject identifiers should not be in the files.

Key parameters: stuff you should set every time

Assuming you can get gaabbreviate to read in your data, you can then set about getting it to abbreviate your measure by selecting a subset of items that retain as much of the variance in the original scales as possible. There are a few parameters you’ll need to set; some are mandatory, others aren’t, but should really be specified anyway since the defaults aren’t likely to work well for different applications.

The most important (and mandatory) argument is iters, which is the number of iterations you want the GA to run for. If you pick too high a number, the GA may take a very long time to run if you have a very long measure; if you pick too low a number, you’re going to get a crappy solution. I think iters=100 is a reasonable place to start, though in practice, obtaining a stable solution tends to require several hundred iterations. The good news (which I cover in more detail below) is that you can take the output you get from the abbreviation function and feed it right back in as many times as you want, so it’s not like you need to choose the number of iterations carefully or anything.

The other two key parameters are itemCost and maxItems. The itemCost is what determines the degree to which your measure is compressed. If you want a detailed explanation of how this works, see the definition of the cost function in the paper. Very briefly, the GA tries to optimize the trade-off between number of items and amount of variance explained. Generally speaking, the point of abbreviating a measure is to maximize the amount of explained variance (in the original scale scores) while minimizing the number of items retained. Unfortunately, you can’t do both very well at the same time, because any time you drop an item, you’re also losing its variance. So the trick is to pick a reasonable compromise: a measure that’s relatively short and still does a decent job recapturing the original. The itemCost parameter is what determines the length of that measure. When you set it high, the GA will place a premium on brevity, resulting in a shorter (but less accurate) measure; when you set it low, it’ll allow a longer measure that maximizes fidelity. The optimal itemCost will vary depending on your data, but I find 0.05 is a good place to start, and then you can tweak it to get measures with more or fewer items as you see fit.

The maxItems parameter sets the upper bound on the number of items that will be used to score each scale. The default is 5, but you may find this number too small if you’re trying to abbreviate scales comprised of a large number of items. Again, it’s worth playing around with this to see what happens. Generally speaks, the same trade-off between brevity and fidelity discussed above holds here too.

Given reasonable values for the above arguments, you should be able to feed in raw data and get out an abbreviated measure with minimal work. Assuming you’re reading your data from a file, the entire stream can be as simple as:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100, itemCost=0.05, maxItems=5, writeFile=’outputfile.txt’)

That’s it! Assuming your data are in the correct format (and if they’re not, the script will probably crash with a nasty error message), gaabbreviate will do its thing and produce your new, shorter measure within a few minutes or hours, depending on the size of the initial measure. The writeFile argument is optional, and gives the name of an output file you want the measure saved to. If you don’t specify it, the output will be assigned to the gaa object in the above call (note the “gaa = ” part of the call), but won’t be written to file. But that’s not a problem, because you can always achieve the same effect later by calling the gaa.writeMeasure function (e.g., in the above example, gaa.writeMeasure(gaa, file=”outputfile.txt”) would achieve exactly the same thing).

Other important options

Although you don’t really need to do anything else to produce abbreviated measures, I strongly recommend reading the rest of this document and exploring some of the other options if you’re planning to use the code, because some features are non-obvious. Also, the code isn’t foolproof, and it can do weird things with your data if you’re not paying attention. For one thing, by default, gaabbreviate will choke on missing values (i.e., NAs). You can do two things to get around this: either enable pairwise processing (pairwise=T), or turn on mean imputation (impute=T). I say you can do these things, but I strongly recommend against using either option. If you have missing values in your data, it’s really a much better idea to figure out how to deal with those missing values before you run the abbreviation function, because the abbreviation function is dumb, and it isn’t going to tell you whether pairwise analysis or imputation is a sensible thing to do. For example, if you have 100 subjects with varying degrees of missing data, and only have, say, 20 subjects’ scores for some scales, the resulting abbreviated measure is going to be based on only 20 subjects’ worth of data for some scales if you turn pairwise processing on. Similarly, imputing the mean for missing values is a pretty crude way to handle missing data, and I only put it in so that people who just wanted to experiment with the code wouldn’t have to go to the trouble of doing it themselves. But in general, you’re much better off reading your item and scale scores into R (or SPSS, or any other package), processing any missing values in some reasonable way, and then feeding gaabbreviate the processed data.

Another important point to note is that, by default, gaabbreviate will cross-validate its results. What that means is that only half of your data will be used to generate an abbreviated measure; the other half will be used to provide unbiased estimates of how well the abbreviation process worked. There’s an obvious trade-off here. If you use the split-half cross-validation approach, you’re going to get more accurate estimates of how well the abbreviation process is really working, but the fit itself might be slightly poorer because you have less data. Conversely, if you turn cross-validation off (crossVal=F), you’re going to be using all of your data in the abbreviation process, but the resulting estimates of the quality of the solution will inevitably be biased because you’re going to be capitalizing on chance to some extent.

In practice, I recommend always leaving cross-validation enabled, unless you either (a) really don’t care about quality control (which makes you a bad person), or (b) have a very small sample size, and can’t afford to leave out half of the data in the abbreviation process (in which case you should consider collecting more data). My experience has been that with 200+ subjects, you generally tend to see stable solutions even when leaving cross-validation on, though that’s really just a crude rule of thumb that I’m pulling out of my ass, and larger samples are always better.

Other less important options

There are a bunch other less important options that I won’t cover in any detail here, but that are reasonably well-covered in the comments in the source file if you’re so inclined. Some of these are used to control the genetic algorithm used in the abbreviation process. The gaa.abbreviate function doesn’t actually do the heavy lifting itself; instead, it relies on the genalg library to run the actual genetic algorithm. Although the default genalg parameters will work fine 95% of the time, if you really want to manually set the size of the population or the ratio of initial zeros to ones, you can pass those arguments directly. But there’s relatively little reason to play with these parameters, because you can always achieve more or less the same ends simply by adding iterations.

Two other potentially useful options I won’t touch on, though they’re there if you want them, give you the ability to (a) set a minimum bound on the correlation required in order for an item to be included in the scoring equation for a scale (the minR argument), and (b) apply non-unit weightings to the scales (the sWeights argument), in cases where you want to emphasize some scales at the cost of others (i.e., because you want to measure some scales more accurately).

Two examples

The following two examples assume you’re feeding in item and scale matrices named myItems and myScales, respectively:

example1

This will run a genetic algorithm for 500 generations on mean-imputed data with cross-validation turned off, and assign the result to a variable named my.new.shorter.measure. It will probably produce an only slightly shorter measure, because the itemCost is low and up to 10 items are allowed to load on each scale.

example2

This will run 100 iterations with cross-validation enabled (the default, so we don’t need to specify it explicitly) and write the result to a file named shortMeasure.txt. It’ll probably produce a highly abbreviated measure, because the itemCost is relatively high. It also assigns more weight (twice as much, in fact) to the fourth and fifth scales in the measure relative to the first three, as reflected in the sWeights argument (a vector where the ith element indicates the weight of the ith scale in the measure, so presumably there are five scales in this case).

The gaa object

Assuming you’ve read this far, you’re probably wondering what you get for your trouble once you’ve run the abbreviation function. The answer is that you get… a gaa (which stands for GA Abbreviate) object. The gaa object contains almost all the information that was used at any point in the processing, which you can peruse at your leisure. If you’re familiar with R, you’ll know that you can see what’s in the object with the attributes function. For example, if you assigned the result of the abbreviation function to a variable named ‘myMeasure’, here’s what you’d see:

attributes

The gaa object has several internal lists (data, settings, results, etc.), each of which in turn contains several other variables. I’ve tried to give these sensible names. In brief:

  • data contains all the data used to create the measure (i.e., the item and scale scores you fed in)
  • settings contains all the arguments you specified when you called the abbreviation function (e.g., iters, maxItems, etc.)
  • results contains variables summarizing the results of the GA run, including information about each previous iteration of the GA
  • best contains information about the single best measure produced (this is generally not useful, and is for internal purposes)
  • rbga is the rbga.bin object produced by the genetic library (for more information, see the genalg library documentation)
  • measure is what you’ll probably find most important, as it contains the details of the final measure that was produced

To see the contents of each of these lists in turn, you can easily inspect them:

measure

So the ‘measure’ attribute in the gaa object contains a bunch of other variables with information about the resulting measure. And here’s a brief summary:

  • items: a vector containing the numerical ID of items retained in the final measure relative to the original list (e.g., if you fed in 100 items, and the ‘items’ variable contained the numbers 4, 10, 14… that’s the GA telling you that it decided to keep items no. 4, 10, 14, etc., from the original set of 100 items).
  • nItems: the number of items in the final measure.
  • key: a scoring key for the new measure, where the rows are items on the new measure, and the columns are the scales. This key is compatible with score.items() in Bill Revelle’s excellent psych package, which means that once you’ve got the key, you can automatically score data for the new measure simply by calling score.items() (see the documentation for more details), and don’t need to do any manual calculating or programming yourself.
  • ccTraining and ccValidation: convergent correlations for the training and validation halves of the data, respectively. The convergent correlation is the correlation between the new scale scores (i.e., those that you get using the newly-generated measure) and the “real” scale scores. The ith element in the vector gives you the convergent correlation for the ith scale in your original measure. The validation coefficients will almost invariably be lower than the training coefficients, and the validation numbers are the ones you should trust as an unbiased estimate of the quality of the measure.
  • alpha: coefficient alpha for each scale. Note that you should expect to get lower internal consistency estimates for GA-produced measures than you’re used to, and this is actually a good thing. If you want to know why, read the discussion in the paper.
  • nScaleItems: a vector containing the number of items used to score each scale. If you left minR set to 0, this will always be identical to maxItems for all items. If you raised minR, the number of items will sometimes be lower (i.e., in cases where there were very few items that showed a strong enough correlation to be retained).

Just give me the measure already!

Supposing you’re not really interested in plumbing the depths of the gaa object or working within R more than is necessary, you might just be wondering what the quickest way to get an abbreviated measure you can work with is. In that case, all you really need to do is pass a filename in the writeFile argument when you call gaa.abbreviate (see the examples given above), and you’ll get out a plain text file that contains all the essential details of the new measure. Specifically you’ll get (a) a mapping from old items to new, so that you can figure out which items are included in the new measure (e.g., a line like “4 45” means that the 4th item on the new measure is no. 45 in the original set of items), and (b) a human-readable scoring key for each scale (the only thing to note here is that an “R” next to an item indicates the item is reverse-keyed), along with key statistics (coefficient alpha and convergent correlations for the training and validation halves). So if all goes well, you really won’t need to do anything else in R beyond call that one line that makes the measure. But again, I’d strongly encourage you to carefully inspect the gaa object in R to make sure everything looks right. The fact that the abbreviation process is fully automated isn’t a reason to completely suspend all rational criteria you’d normally use when developing a scale; it just means you probably have to do substantially less work to get a measure you’re happy with.

Killing time…

Depending on how big your dataset is (actually, mainly the number of items in the original measure), how many iterations you’ve requested, and how fast your computer is, you could be waiting a long time for the abbreviation function to finish its work. Because you probably want to know what the hell is going on internally during that time, I’ve provided a rudimentary monitoring display that will show you the current state of the genetic algorithm after every iteration. It looks like this (click for a larger version of the image):

display

This is admittedly a pretty confusing display, and Edward Tufte would probably murder several kittens if he saw it, but it’s not supposed to be a work of art, just to provide some basic information while you’re sitting there twiddling your thumbs (ok, ok, I promise I’ll label the panels better when I have the time to work on it). But basically, it shows you three things. The leftmost three panels show you the basic information about the best measure produced by the GA as it evolves across generations. Respectively, the top, middle,and bottom panels show you the total cost, measure length, and mean variance explained (R^2) as a function of iteration. The total cost can only ever go down, but the length and R^2 can go up or down (though there will tend to be a consistent trajectory for measure length that depends largely on what itemCost you specified).

The middle panel shows you detailed information about how well the GA-produced measure captures variance in each of the scales in the original measure. In this case, I’m abbreviating the 30 facets of the NEO-PI-R. The red dot displays the amount of variance explained in each trait, as of the current iteration.

Finally, the rightmost panel shows you a graphical representation of which items are included in the best measure identified by the GA at each iteration.Each row represents one iteration (i.e., you’re seeing the display as it appears after 200 iterations of a 250-iteration run); black bars represent items that weren’t included, white bars represent items that were included. The point of this display isn’t to actually tell you which items are being kept (you can’t possibly glean that level of information at this resolution), but rather, to give you a sense of how stable the solution is. If you look at the the first few (i.e., topmost) iterations, you’ll see that the solution is very unstable: the GA is choosing very different items as the “best” measure on each iteration. But after a while, as the GA “settles” into a neighborhood, the solution stabilizes and you see only relatively small (though still meaningful) changes from generation to generation. Basically, once the line in the top left panel (total cost) has asymptoted, and the solution in the rightmost panel is no longer changing much if at all, you know that you’ve probably arrived at as good a solution as you’re going to get.

Incidentally, if you use the generic plot() method on a completed gaa object (e.g., plot(myMeasure)), you’ll get exactly the same figure you see here, with the exception that the middle figure will also have black points plotted alongside the red ones.  The black points show you the amount of variance explained in each trait for the cross-validated results. If you’re lucky, the red and black points will be almost on top of each other; if you’re not, the black ones will be considerably to the left of the red ones .

Consider recycling

The last thing I’ll mention, which I already alluded to earlier, is that you can recycle gaa objects. That’s to say, suppose you ran the abbreviation for 100 iterations, only to get back a solution that’s still clearly suboptimal (i.e., the cost function is still dropping rapidly). Rather than having to start all over again, you can simply feed the gaa object back into the abbreviation function in order to run further iterations. And you don’t need to specify any additional parameters (assuming you want to run the same number of iterations you did last time; otherwise you’ll need to specify iters); all of the settings are contained within the gaa object itself. So, assuming you ran the abbreviation function and stored the result in ‘myMeasure’, you can simply do:

myMeasure = gaa.abbreviate(myMeasure, iters=200)

and you’ll get an updated version of the measure that’s had the benefit of an extra 200 iterations. And of course, you can save and load R objects to/from files, so that you don’t need to worry about all of your work disappearing next time you start R. So save(myMeasure, ‘filename.txt’) will save your gaa object for future use, and the next time you need it, you can call myMeasure = load(‘filename.txt’) to get it back (alternatively, you can just save the entire workspace).

Anyway, I think that covers all of the important stuff. There are a few other things I haven’t documented here, but if you’ve read this far, and have gotten the code to work in R, you should be able to start abbreviating your own measures relatively painlessly. If you do use the code to generate shorter measures, and end up with measures you’re happy with, I’d love to hear about it. And if you can’t get the code to work, or can get it to work but are finding issues with the code or the results, I guess I’ll grudgingly accept those emails too. In general, I’m happy to provide support for the code via email provided I have the time. The caveat is that, if you’re new to R, and are having problems with basic things like installing packages or loading files from source, you should really read a tutorial or reference that introduces you to R (Quick-R is my favorite place to start) before emailing me with problems. But if you’re having problems that are specific to the gaabbreviate code (e.g., you’re getting a weird error message, or aren’t sure what something means), feel free to drop me a line and I’ll try to respond as soon as I can.

elsewhere on the net

I’ve been swamped with work lately, so blogging has taken a backseat. I keep a text file on my desktop of interesting things I’d like to blog about; normally, about three-quarters of the links I paste into it go unblogged, but in the last couple of weeks it’s more like 98%. So here are some things I’ve found interesting recently, in no particular order:

It’s World Water Day 2010! Or at least it was a week ago, which is when I should have linked to these really moving photos.

Carl Zimmer has a typically brilliant (and beautifully illustrated) article in the New York Times about “Unseen Beasts, Then and Now“:

Somewhere in England, about 600 years ago, an artist sat down and tried to paint an elephant. There was just one problem: he had never seen one.

John Horgan writes a surprisingly bad guest blog post for Scientific American in which he basically accuses neuroscientists (not a neuroscientist or some neuroscientists, but all of us, collectively) of selling out by working with the US military. I’m guessing that the number of working neuroscientists who’ve ever received any sort of military funding is somewhere south of 10%, and is probably much smaller than the corresponding proportion in any number of other scientific disciplines, but why let data get in the way of a good anecdote or two. [via Peter Reiner]

Mark Liberman follows up his first critique of Louann Brizendine’s new “book” The Male Brain with second one, now that he’s actually got his hands on a copy. Verdict: the book is still terrible. Mark was also kind enough to answer my question about what the mysterious “sexual pursuit area” is. Apparently it’s the medial preoptic area. And the claim that this area governs sexual behavior in humans and is 2.5 times larger in males is, once again, based entirely on work in the rat.

Commuting sucks. Jonah Lehrer discusses evidence from happiness studies (by way of David Brooks) suggesting that most people would be much happier living in a smaller house close to work than a larger house that requires a lengthy commute:

According to the calculations of Frey and Stutzer, a person with a one-hour commute has to earn 40 percent more money to be as satisfied with life as someone who walks to the office.

I’ve taken these findings to heart, and whenever my wife and I move now, we prioritize location over space. We’re currently paying through the nose to live in a 750 square foot apartment near downtown Boulder. It’s about half the size of our old place in St. Louis, but it’s close to everything, including our work, and we love living here.

The modern human brain is much bigger than it used to be, but we didn’t get that way overnight. John Hawks disputes Colin Blakemore’s claim that “the human brain got bigger by accident and not through evolution“.

Sanjay Srivastava leans (or maybe used to lean) toward the permissive side; Andrew Gelman is skeptical. Attitudes toward causal modeling of correlational (and even some experimental) data differ widely. There’s been a flurry of recent work suggesting that causal modeling techniques like mediation analysis and SEM suffer from a number of serious and underappreciated problems, and after reading this paper by Bullock, Green and Ha, I guess I incline to agree.

A landmark ruling by a New York judge yesterday has the potential to invalidate existing patents on genes, which currently cover about 20% of the human genome in some form. Daniel MacArthur has an excellent summary.

the male brain hurts, or how not to write about science

My wife asked me to blog about this article on CNN because, she said, “it’s really terrible, and it shouldn’t be on CNN”. I usually do what my wife tells me to do, so I’m blogging about it. It’s by Louann Brizendine, M.D., author of the absolutely awful controversial book The Female Brain, and now, its manly counterpart, The Male Brain. From what I can gather, the CNN article, which is titled Love, Sex, and the Male Brain, is a precis of Brizendine’s new book (though I have no intention of reading the book to make sure). The article is pretty short, so I’ll go through the first half of it paragraph-by-paragraph. But I’ll warn you right now that it isn’t pretty, and will likely anger anyone with even a modicum of training in psychology or neuroscience.

Although women the world over have been doing it for centuries, we can’t really blame a guy for being a guy. And this is especially true now that we know that the male and female brains have some profound differences.

Our brains are mostly alike. We are the same species, after all. But the differences can sometimes make it seem like we are worlds apart.

So far, nothing terribly wrong here, just standard pop psychology platitudes. But it goes quickly downhill.

The “defend your turf” area — dorsal premammillary nucleus — is larger in the male brain and contains special circuits to detect territorial challenges by other males. And his amygdala, the alarm system for threats, fear and danger is also larger in men. These brain differences make men more alert than women to potential turf threats.

As Vaughan notes over at Mind Hacks, the dorsal premammillary nucleus (PMD) hasn’t been identified in humans, so it’s unclear exactly what chunk of tissue Brizendine’s referring to–let alone where the evidence that there are gender differences in humans might come from. The claim that the PMD is a “defend your turf” area might be plausible, if oh, I don’t know, you happen to think that the way rats behave under narrowly circumscribed laboratory conditions when confronted by an aggressor is a good guide to normal interactions between human males. (Then again, given that PMD lesions impair rats from running away when exposed to a cat, Brizendine could just as easily have concluded that the dorsal premammillary nucleus is the “fleeing” part of the brain.)

The amygdala claim is marginally less ridiculous: it’s not entirely clear that the amygdala is “the alarm system for threats, fear and danger”, but at least that’s a claim you can make with a straight face, since it’s one fairly common view among neuroscientists. What’s not really defensible is the claim that larger amygdalae “make men more alert than women to potential turf threats”, because (a) there’s limited evidence that the male amygdala really is larger than the female amygdala and (b) if such a difference exists, it’s very small, and (c) it’s not clear in any case how you go from a small between-group difference to the notion that somehow the amygdala is the reason why men maintain little interpersonal fiefdoms and women don’t.

Meanwhile, the “I feel what you feel” part of the brain — mirror-neuron system — is larger and more active in the female brain. So women can naturally get in sync with others’ emotions by reading facial expressions, interpreting tone of voice and other nonverbal emotional cues.

This falls under the rubric of “not even wrong“. The mirror neuron system isn’t a single “part of the brain”; current evidence suggests that neurons that show mirroring properties are widely distributed throughout multiple frontoparietal regions. So I don’t really know what brain region Brizendine is referring to (the fact that she never cites any empirical studies in support of her claims is something of an inconvenience in that respect). And even if I did know, it’s a safe bet it wouldn’t be the “I feel what you feel” brain region, because, as far as I know, no such thing exists. The central claim regarding mirror neurons isn’t that they support empathy per se, but that they support a much more basic type of representation–namely, abstract conceptual (as opposed to sensory/motor) representation of actions. And even that much weaker notion is controversial; for example, Greg Hickok has a couple of recent posts (and a widely circulated paper) arguing against it. No one, as far as I know, has provided any kind of serious evidence linking the mirror neuron system to females’ (modestly) superior nonverbal decoding ability.

Perhaps the biggest difference between the male and female brain is that men have a sexual pursuit area that is 2.5 times larger than the one in the female brain. Not only that, but beginning in their teens, they produce 200 to 250 percent more testosterone than they did during pre-adolescence.

Maybe the silliest paragraph in the whole article. Not only do I not know what region Brizendine is talking about here, I have absolutely no clue what the “sexual pursuit area” might be. It could be just me, I suppose, but I just searched Google Scholar for “sexual pursuit area” and got… zero hits. Is it a visual region? A part of the hypothalamus? The notoriously grabby motor cortex hand area? No one knows, and Brizendine isn’t telling.  Off-hand, I don’t know of any region of the human brain that shows the degree of sexual dimorphism Brizendine claims here.

If testosterone were beer, a 9-year-old boy would be getting the equivalent of a cup a day. But a 15-year-old would be getting the equivalent of nearly two gallons a day. This fuels their sexual engines and makes it impossible for them to stop thinking about female body parts and sex.

If each fiber of chest hair was a tree, a 12-year-old boy would have a Bonsai sitting on the kitchen counter, and a 30-year-old man would own Roosevelt National Forest. What you’re supposed to learn from this analogy, I honestly couldn’t tell you. It’s hard for me to think clearly about trees and hair you see, seeing as how I find it impossible to stop thinking about female body parts while I’m trying to write this.

All that testosterone drives the “Man Trance”– that glazed-eye look a man gets when he sees breasts. As a woman who was among the ranks of the early feminists, I wish I could say that men can stop themselves from entering this trance. But the truth is, they can’t. Their visual brain circuits are always on the lookout for fertile mates. Whether or not they intend to pursue a visual enticement, they have to check out the goods.

To a man, this is the most natural response in the world, so he’s dismayed by how betrayed his wife or girlfriend feels when she sees him eyeing another woman. Men look at attractive women the way we look at pretty butterflies. They catch the male brain’s attention for a second, but then they flit out of his mind. Five minutes later, while we’re still fuming, he’s deciding whether he wants ribs or chicken for dinner. He asks us, “What’s wrong?” We say, “Nothing.” He shrugs and turns on the TV. We smolder and fear that he’ll leave us for another woman.

This actually isn’t so bad if you ignore the condescending “men are animals with no self-control” implication and pretend Brizendine had just made the  indisputably true but utterly banal observation that men, on average, like to ogle women more than women, on average, like to ogle men.

Not surprisingly, the different objectives that men and women have in mating games put us on opposing teams — at least at first. The female brain is driven to seek security and reliability in a potential mate before she has sex. But a male brain is fueled to mate and mate again. Until, that is, he mates for life.

So men are driven to sleep around, again and again… until they stop sleeping around. It’s tautological and profound at the same time!

Despite stereotypes to the contrary, the male brain can fall in love just as hard and fast as the female brain, and maybe more so. When he meets and sets his sights on capturing “the one,” mating with her becomes his prime directive. And when he succeeds, his brain makes an indelible imprint of her. Lust and love collide and he’s hooked.

Failure to operationalize complex construct of “love” in a measurable way… check. Total lack of evidence in support of claim that men and women are equally love-crazy… check. Oblique reference to Star Trek universe… check. What’s not to like?

A man in hot pursuit of a mate doesn’t even remotely resemble a devoted, doting daddy. But that’s what his future holds. When his mate becomes pregnant, she’ll emit pheromones that will waft into his nostrils, stimulating his brain to make more of a hormone called prolactin. Her pheromones will also cause his testosterone production to drop by 30 percent.

You know, on the off-chance that something like this is actually true, I think it’s actually kind of neat. But I just can’t bring myself to do a literature search, because I’m pretty sure I’ll discover that the jury is still out on whether humans even emit and detect pheromones (ok, I know this isn’t a completely baseless claim), or that there’s little to no evidence of a causal relationship between women releasing pheromones and testosterone levels dropping in men. I don’t like to be disappointed, you see; it turns out it’s much easier to just decide what you want to believe ahead of time and then contort available evidence to fit that view.

Anyway, we’re only half-way through the article; Brizendine goes on in similar fashion for several hundred more words. Highlights include the origin of male poker face, the conflation of correlation and causation in sociable elderly men, and the effects of oxytocin on your grandfather. You should go read the reset of it if you practice masochism; I’m too full of rage depressed to write about it any more.

Setting aside the blatant exercise in irresponsible scientific communication (Brizendine has an MD, and appears to be at least nominally affiliated with UCSF’s psychiatry department, so ignorance shouldn’t really be a valid excuse here), I guess what I’d really like to know is what goes through Brizendine’s mind when she writes this sort of dreck. Does she really believe the ludicrous claims she makes? Is she fully aware she’s grossly distorting the empirical evidence if not outright confabulating, and is simply in it for the money? Or does she rationalize it as a case of the ends justifying the means, thinking the message she’s presenting is basically right, so it’s ok if nearly all a few of the details go missing in the process?

I understand that presenting scientific evidence in an accurate and entertaining manner is a difficult business, and many people who work hard at it still get it wrong pretty often (I make mistakes in my posts here all the time!). But many scientists still manage to find time in their busy schedules to write popular science books that present the science in an accessible way without having to make up ridiculous stories just to keep the reader entertained (Steven Pinker, Antonio Damasio, and Dan Gilbert are just a few of the first ones that spring to mind). And then there are amazing science writers like Carl Zimmer and David Dobbs who don’t necessarily have any professional training in the areas they write about, but still put in the time and energy to make sure they get the details right, and consistently write stories that blow me away (the highest compliment I can pay to a science story is that it makes me think “I wish I studied that“, and Zimmer’s articles routinely do that). That type of intellectual honesty is essential, because there’s really no point in going to the trouble of doing most scientific research if people get to disregard any findings they disagree with on ideological or aesthetic grounds, or can make up any evidence they like to fit their claims.

The sad thing is that Brizendine’s new book will probably sell more copies in its first year out than Carl Zimmer’s entire back catalogue. And it’s not going to sell all those copies because it’s a careful meditation on the subtle differences between genders that scientists have uncovered; it’s going to fly off the shelves because it basically regurgitates popular stereotypes about gender differences with a seemingly authoritative scientific backing. Instead of evaluating and challenging many of those notions with actual empirical data, people who read Brizendine’s work will now get to say “science proves it!”, making it that much more difficult for responsible scientists and journalists to tell the public what’s really true about gender differences.

You might say (or at least, Brizendine might say) that this is all well and good, but hopelessly naive and idealistic, and that telling an accurate story is always going to be less important than telling the public what it wants to hear about science, because the latter is the only way to ensure continued funding for and interest in scientific research. This isn’t that uncommon a sentiment; I’ve even heard a number of scientists who I otherwise have a great deal of respect for say something like this. But I think Brizendine’s work underscores the typical outcome of that type of reasoning: once you allow yourself to relax the standards for what counts as evidence, it becomes quite easy to rationalize almost any rhetorical abuse of science, and ultimately you abuse the public’s trust while muddying the waters for working scientists.

As with so many other things, I think Richard Feynman summed up this sentiment best:

I would like to add something that’s not essential to the science, but something I kind of believe, which is that you should not fool the layman when you’re talking as a scientist. I am not trying to tell you what to do about cheating on your wife, or fooling your girlfriend, or something like that, when you’re not trying to be a scientist, but just trying to be an ordinary human being. We’ll leave those problems up to you and your rabbi. I’m talking about a specific, extra type of integrity that is not lying, but bending over backwards to show how you are maybe wrong, that you ought to have when acting as a scientist. And this is our responsibility as scientists, certainly to other scientists, and I think to laymen.

For example, I was a little surprised when I was talking to a friend who was going to go on the radio. He does work on cosmology and astronomy, and he wondered how he would explain what the applications of this work were. “Well,” I said, “there aren’t any.” He said, “Yes, but then we won’t get support for more research of this kind.” I think that’s kind of dishonest. If you’re representing yourself as a scientist, then you should explain to the layman what you’re doing–and if they don’t want to support you under those circumstances, then that’s their decision.

No one doubts that men and women differ from one another, and the study of gender differences is an active and important area of psychology and neuroscience. But I can’t for the life of me see any merit in telling the public that men can’t stop thinking about breasts because they’re full of the beer-equivalent of two gallons of testosterone.

[Update 3/25: plenty of other scathing critiques pop up in the blogosphere today: Language Log, Salon, and Neuronarrative, and no doubt many others…]

green chile muffins and brains in a truck: weekend in albuquerque

I spent the better part of last week in Albuquerque for the Mind Research Network fMRI course. It’s a really well-organized 3-day course, and while it’s geared toward people without much background in fMRI, I found a lot of the lectures really helpful. It’s hard impossible to get everything right when you run an fMRI study; the magnet is very fickle and doesn’t like to do what you ask it to–and that assumes you’re asking it to do the right thing, which is also not so common. So I find I learn something interesting from almost every fMRI talk I attend, even when it’s stuff I thought I already knew.

Of course, since I know very little, there’s also almost always stuff that’s completely new to me. In this case, it was a series of lectures on independent components analysis (ICA) of fMRI data, focusing on Vince Calhoun‘s group’s implementation of ICA in the GIFT toolbox. It’s a beautifully implemented set of tools that offer a really powerful alternative to standard univariate analysis, and I’m pretty sure I’ll be using it regularly from now on. So the ICA lectures alone were worth the price of admission. (In the interest of full disclosure, I should note that my post-doc mentor, Tor Wager, is one of the organizers of the MRN course, and I wasn’t paying the $700 tab out of pocket. But I’m not getting any kickbacks to say nice things about the course, I promise.)

Between the lectures and the green chile corn muffins, I didn’t get to see much of Albuquerque (except from the air, where the urban sprawl makes the city seem much larger than its actual population of 800k people would suggest), so I’ll reserve judgment for another time. But the MRN itself is a pretty spectacular facility. Aside from a 3T Siemens Trio magnet, they also have a 1.5T mobile scanner built into a truck. It’s mostly used to scan inmates in the New Mexico prison system (you’ll probably be surprised to learn that they don’t let hardened criminals out of jail to participate in scientific experiments–so the scanner has to go to jail instead). We got a brief tour of the mobile scanner and it was pretty awesome. Which is to say, it beats the pants off my Honda.

There are also some parts of the course I don’t remember so well. Here’s a (blurry) summary of those parts, courtesy of Alex Shackman:

Scott, Tor, and me in Albuquerque
BlurryScott, BlurryTor, and BlurryTal: The Boulder branch of the lab, Albuquerque 2010 edition

scientists aren’t dumb; statistics is hard

There’s a feature article in the new issue of Science News on the failure of science “to face the shortcomings of statistics”. The author, Tom Siegfried, argues that many scientific results shouldn’t be believed because they depend on faulty statistical practices:

Even when performed correctly, statistical tests are widely misunderstood and frequently misinterpreted. As a result, countless conclusions in the scientific literature are erroneous, and tests of medical dangers or treatments are often contradictory and confusing.

I have mixed feelings about the article. It’s hard to disagree with the basic idea that many scientific results are the results of statistical malpractice and/or misfortune. And Siegfried generally provides lucid explanations of some common statistical pitfalls when he sticks to the descriptive side of things. For instance, he gives nice accounts of Bayesian inference, of the multiple comparisons problem, and of the distinction between statistical significance and clinical/practical significance. And he nicely articulates what’s wrong with one of the most common (mis)interpretations of p values:

Correctly phrased, experimental data yielding a P value of .05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the .05 level: “This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.“

So as a laundry list of common statistical pitfalls, it works quite nicely.

What I don’t really like about the article is that it seems to lay the blame squarely on the use of statistics to do science, rather than the way statistical analysis tends to be performed. That’s to say, a lay person reading the article could well come away with the impression that the very problem with science is that it relies on statistics. As opposed to the much more reasonable conclusion that science is hard, and statistics is hard, and ensuring that your work sits at the intersection of good science and good statistical practice is even harder. Siegfried all but implies that scientists are silly to base their conclusions on statistical inference. For instance:

It’s science’s dirtiest secret: The “scientific method“ of testing hypotheses by statistical analysis stands on a flimsy foundation. Statistical tests are supposed to guide scientists in judging whether an experimental result reflects some real effect or is merely a random fluke, but the standard methods mix mutually inconsistent philosophies and offer no meaningful basis for making such decisions.

Or:

Experts in the math of probability and statistics are well aware of these problems and have for decades expressed concern about them in major journals. Over the years, hundreds of published papers have warned that science’s love affair with statistics has spawned countless illegitimate findings. In fact, if you believe what you read in the scientific literature, you shouldn’t believe what you read in the scientific literature.

The problem is that there isn’t really any viable alternative to the “love affair with statistics”. Presumably Siegfried doesn’t think (most) scientists ought to be doing qualitative research; so the choice isn’t between statistics and no statistics, it’s between good and bad statistics.

In that sense, the tone of a lot of the article is pretty condescending: it comes off more like Siegfried saying “boy, scientists sure are dumb” and less like the more accurate observation that doing statistics is really hard, and it’s not surprising that even very smart people mess up frequently.

What makes it worse is that Siegfried slips up on a couple of basic points himself, and says some demonstrably false things in a couple of places. For instance, he explains failures to replicate genetic findings this way:

Nowhere are the problems with statistics more blatant than in studies of genetic influences on disease. In 2007, for instance, researchers combing the medical literature found numerous studies linking a total of 85 genetic variants in 70 different genes to acute coronary syndrome, a cluster of heart problems. When the researchers compared genetic tests of 811 patients that had the syndrome with a group of 650 (matched for sex and age) that didn’t, only one of the suspect gene variants turned up substantially more often in those with the syndrome — a number to be expected by chance.

“Our null results provide no support for the hypothesis that any of the 85 genetic variants tested is a susceptibility factor“ for the syndrome, the researchers reported in the Journal of the American Medical Association.

How could so many studies be wrong? Because their conclusions relied on “statistical significance,“ a concept at the heart of the mathematical analysis of modern scientific experiments.

This is wrong for at least two reasons. One is that, to believe the JAMA study Siegfried is referring to, and disbelieve the results of all 85 previously reported findings, you have to accept the null hypothesis, which is one of the very same errors Siegfried is supposed to be warning us against. In fact, you have to accept the null hypothesis 85 times. In the JAMA paper, the authors are careful to note that it’s possible the actual effects were simply overstated in the original studies, and that at least some of the original findings might still hold under more restrictive conditions. The conclusion that there really is no effect whatsoever is almost never warranted, because you rarely have enough power to rule out even very small effects. But Siegfried offers no such qualifiers; instead, he happily accepts 85 null hypotheses in support of his own argument.

The other issue is that it isn’t really the reliance on statistical significance that causes replication failures; it’s usually the use of excessively liberal statistical criteria. The problem has very little to do with the hypothesis testing framework per se. To see this, consider that if researchers always used a criterion of p < .0000001 instead of the conventional p < .05, there would almost never be any replication failures (because there would almost never be any statistically significant findings, period). So the problem is not so much with the classical hypothesis testing framework as with the choices many researchers make about how to set thresholds within that framework. (That’s not to say that there aren’t any problems associated with frequentist statistics, just that this isn’t really a fair one.)

Anyway, Siegfried’s explanations of the pitfalls of statistical significance then leads him to make what has to be hands-down the silliest statement in the article:

But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

If you take this statement at face value, you should conclude there’s no point in doing statistical analysis, period. No matter what statistical procedure you use, you’re never going to know for cross-your-heart-hope-to-die sure that your conclusions are warranted. After all, you’re always going to have the same two possibilities: either the effect is real, or it’s not (or, if you prefer to frame the problem in terms of magnitude, either the effect is about as big as you think it is, or it’s very different in size). The same exact conclusion goes through if you take a threshold of p < .001 instead of one of p < .05: the effect can still be a spurious and improbable fluke. And it also goes through if you have twelve replications instead of just one positive finding: you could still be wrong (and people have been wrong). So saying that “two possible conclusions remain” isn’t offering any deep insight; it’s utterly vacuous.

The reason scientists use a conventional threshold of p < .05 when evaluating results isn’t because we think it gives us some magical certainty into whether a finding is “real” or not; it’s because it feels like a reasonable level of confidence to shoot for when making claims about whether the null hypothesis of no effect is likely to hold or not. Now there certainly are many problems associated with the hypothesis testing framework–some of them very serious–but if you really believe that “there’s no logical basis for using a P value from a single study to draw any conclusion,” your beef isn’t actually with p values, it’s with the very underpinnings of the scientific enterprise.

Anyway, the bottom line is Siegfried’s article is not so much bad as irresponsible. As an accessible description of some serious problems with common statistical practices, it’s actually quite good. But I guess the sense I got in reading the article was that at some point Siegfried became more interested in writing a contrarian piece about how scientists are falling down on the job than about how doing statistics well is just really hard for almost all of us (I certainly fail at it all the time!). And ironically, in the process of trying to explain just why “science fails to face the shortcomings of statistics”, Siegfried commits some of the very same errors he’s taking scientists to task for.

[UPDATE: Steve Novella says much the same thing here.]

[UPDATE 2: Andrew Gelman has a nice roundup of other comments on Siegfried’s article throughout the blogosphere.]

fMRI becomes big, big science

There are probably lots of criteria you could use to determine the relative importance of different scientific disciplines, but the one I like best is the Largest Number of Authors on a Paper. Physicists have long had their hundred-authored papers (see for example this individual here; be sure to click on the “show all authors/affiliations” link), and with the initial sequencing and analysis of the human genome, which involved contributions from 452 different persons, molecular geneticists also joined the ranks of Officially Big Science. Meanwhile, us cognitive neuroscientists have long had to content ourselves with silly little papers that have only four to seven authors (maybe a dozen on a really good day). Which means, despite the pretty pictures we get to put in our papers, we’ve long had this inferiority complex about our work, and a nagging suspicion that it doesn’t really qualify as big science (full disclosure: so when I say “we”, I probably just mean “I”).

UNTIL NOW.

Thanks to the efforts of Bharat Biswal and 53 collaborators (yes, I counted) reported in a recent paper in PNAS, fMRI is now officially Big, Big Science. Granted, 54 authors is still small potatoes in physics-and-biology-land. And for all I know, there could be other fMRI papers with even larger author lists out there that I’ve missed.  BUT THAT’S NOT THE POINT. The point is, people like me now get to run around and say we do something important.

You might think I’m being insincere here, and that I’m really poking fun at ridiculously long author lists that couldn’t possibly reflect meaningful contributions from that many people. Well, I’m not. While I’m not seriously suggesting that the mark of good science is how many authors are on the paper, I really do think that the prevalence of long author lists in a discipline are an important sign of a discipline’s maturity, and that the fact that you can get several dozen contributors to a single paper means you’re seeing a level of collaboration across different labs that previously didn’t exist.

The importance of large-scale collaboration is one of the central elements of the new PNAS article, which is appropriately entitled Toward discovery science of human brain function. What Biswal et al have done is compile the largest publicly-accessible fMRI dataset on the planet, consisting of over 1,400 scans from 35 different centers. All of the data, along with some tools for analysis, are freely available for download from NITRC. Be warned though: you’re probably going to need a couple of terabytes of free space if you want to download the entire dataset.

You might be wondering why no one’s assembled an fMRI dataset of this scope until now; after all, fMRI isn’t that new a technique, having been around for about 20 years now. The answer (or at least, one answer) is that it’s not so easy–and often flatly impossible–to combine raw fMRI datasets in any straightforward way. The problem is that the results of any given fMRI study only really make sense in the context of a particular experimental design. Functional MRI typically measures the change in signal associated with some particular task, which means that you can’t really go about combining the results of studies of phonological processing with those of thermal pain and obtain anything meaningful (actually, this isn’t entirely true; there’s a movement afoot to create image-based centralized databases that will afford meta-analyses on an even more massive scale,  but that’s a post for another time). You need to ensure that the tasks people performed across different sites are at least roughly in the same ballpark.

What allowed Biswal et al  to consolidate datasets to such a degree is that they focused exclusively on one particular kind of cognitive task. Or rather, they focused on a non-task: all 1400+ scans in the 1000 Functional Connectomes Project (as they’re calling it) are from participants being scanned during the “resting state”. The resting state is just what it sounds like: participants are scanned while they’re just resting; usually they’re given no specific instructions other than to lie still, relax, and not fall asleep. The typical finding is that, when you contrast this resting state with activation during virtually any kind of goal-directed processing, you get widespread activation increases in a network that’s come to be referred to as the “default” or “task-negative” network (in reference to the fact that it’s maximally active when people are in their “default” state).

One of the main (and increasingly important) applications of resting state fMRI data is in functional connectivity analyses, which aim to identify patterns of coactivation across different regions rather than mean-level changes associated with some task. The fundamental idea is that you can get a lot of traction on how the brain operates by studying how different brain regions interact with one another spontaneously over time, without having to impose an external task set. The newly released data is ideal for this kind of exploration, since you have a simply massive dataset that includes participants from all over the world scanned in a range of different settings using different scanners. So if you want to explore the functional architecture of the human brain during the resting state, this should really be your one-stop shop. (In fact, I’m tempted to say that there’s going to be much less incentive for people to collect resting-state data from now on, since there really isn’t much you’re going to learn from one sample of 20 – 30 people that you can’t learn from 1,400 people from 35+ combined samples).

Aside from introducing the dataset to the literature, Biswal et al also report a number of new findings. One neat finding is that functional parcellation of the brain using seed-based connectivity (i.e., identifying brain regions that coactivate with a particular “seed” or target region) shows marked consistency across different sites, revealing what Biswal et al call a “universal architecture”. This type of approach by itself isn’t particularly novel, as similar techniques have been used before. Bt no one’s done it on anything approaching this scale. Here’s what the results look like:

You can see that different seeds produce difference functional parcellations across the brain (the brighter areas denote ostensive boundaries).

Another interesting finding is the presence of gender and age differences in functional connectivity:

What this image shows is differences in functional connectivity with specific seed regions (the black dots) as a function of age (left) or gender (right). (The three rows reflect different techniques for producing the maps, with the upshot being that the results are very similar regardless of exactly how you do the analysis.) It isn’t often you get to see scatterplots with 1,400+ points in cognitive neuroscience, so this is a welcome sight. Although it’s also worth pointing out the inevitable downside of having huge sample sizes, which is that even tiny effects attain statistical significance. Which is to say, while the above findings are undoubtedly more representative of gender and age differences in functional connectivity than anything else you’re going to see for a long time, notice that they’re they’re very small effects (e.g., in the right panels, you can see that the differences between men and women are only a fraction of a standard deviation in size, despite the fact that these regions are probably selected because they show some of the “strongest” effects). That’s not meant as a criticism; it’s actually a very good thing, in that these modest effects are probably much closer to the truth than what previous studies have reported. Such findings should serve as an important reminder that most of the effects identified by fMRI studies are almost certainly massively inflated by small sample size (as I’ve discussed before here and in this paper).

Anyway, the bottom line is that if you’ve ever thought to yourself, “gee, I wish I could do cutting-edge fMRI research, but I really don’t want to leave my house to get a PhD; it’s almost lunchtime,” this is your big chance. You can download the data, rejoice in the magic that is the resting state, and bathe yourself freely in functional connectivity. The Biswal et al paper bills itself as “a watershed event in functional imaging,” and it’s hard to argue otherwise. Researchers now have a definitive data set to use for analyses of functional connectivity and the resting state, as well as a model for what other similar data sets might look like in the future.

More importantly, with 54 authors on the paper, fMRI is now officially big science. Prepare to suck it, Human Genome Project!

ResearchBlogging.orgBiswal, B., Mennes, M., Zuo, X., Gohel, S., Kelly, C., Smith, S., Beckmann, C., Adelstein, J., Buckner, R., Colcombe, S., Dogonowski, A., Ernst, M., Fair, D., Hampson, M., Hoptman, M., Hyde, J., Kiviniemi, V., Kotter, R., Li, S., Lin, C., Lowe, M., Mackay, C., Madden, D., Madsen, K., Margulies, D., Mayberg, H., McMahon, K., Monk, C., Mostofsky, S., Nagel, B., Pekar, J., Peltier, S., Petersen, S., Riedl, V., Rombouts, S., Rypma, B., Schlaggar, B., Schmidt, S., Seidler, R., Siegle, G., Sorg, C., Teng, G., Veijola, J., Villringer, A., Walter, M., Wang, L., Weng, X., Whitfield-Gabrieli, S., Williamson, P., Windischberger, C., Zang, Y., Zhang, H., Castellanos, F., & Milham, M. (2010). Toward discovery science of human brain function Proceedings of the National Academy of Sciences, 107 (10), 4734-4739 DOI: 10.1073/pnas.0911855107

what the general factor of intelligence is and isn’t, or why intuitive unitarianism is a lousy guide to the neurobiology of higher cognitive ability

This post shamelessly plagiarizes liberally borrows ideas from a much longer, more detailed, and just generally better post by Cosma Shalizi. I’m not apologetic, since I’m a firm believer in the notion that good ideas should be repeated often and loudly. So I’m going to be often and loud here, though I’ll try to be (slightly) more succinct than Shalizi. Still, if you have the time to spare, you should read his longer and more mathematical take.

There’s a widely held view among intelligence researchers in particular, and psychologists more generally, that there’s a general factor of intelligence (often dubbed g) that accounts for a very large portion of the variance in a broad range of cognitive performance tasks. Which is to say, if you have a bunch of people do a bunch of different tasks, all of which we think tap different aspects of intellectual ability, and then you take all those scores and factor analyze them, you’ll almost invariably get a first factor that explains 50% or more of the variance in the zero-order scores. Or to put it differently, if you know a person’s relative standing on g, you can make a reasonable prediction about how that person will do on lots of different tasks–for example, digit symbol substitution, N-back, go/no-go, and so on and so forth. Virtually all tasks that we think reflect cognitive ability turn out, to varying extents, to reflect some underlying latent variable, and that latent variable is what we dub g.

In a trivial sense, no one really disputes that there’s such a thing as g. You can’t really dispute the existence of g, seeing as a general factor tends to fall out of virtually all factor analyses of cognitive tasks; it’s about as well-replicated a finding as you can get. To say that g exists, on the most basic reading, is simply to slap a name on the empirical fact that scores on different cognitive measures tend to intercorrelate positively to a considerable extent.

What’s not so clear is what the implications of g are for our understanding of how the human mind and brain works. If you take the presence of g at face value, all it really says is what we all pretty much already know: some people are smarter than others. People who do well in one intellectual domain will tend to do pretty well in others too, other things being equal. With the exception of some people who’ve tried to argue that there’s no such thing as general intelligence, but only “multiple intelligences” that totally fractionate across domains (not a compelling story, if you look at the evidence), it’s pretty clear that cognitive abilities tend to hang together pretty well.

The trouble really crops up when we try to say something interesting about the architecture of the human mind on the basis of the psychometric evidence for g. If someone tells you that there’s a single psychometric factor that explains at least 50% of the variance in a broad range of human cognitive abilities, it seems perfectly reasonable to suppose that that’s because there’s some unitary intelligence system in people’s heads, and that that system varies in capacity across individuals. In other words, the two intuitive models people have about intelligence seem to be that either (a) there’s some general cognitive system that corresponds to g, and supports a very large portion of the complex reasoning ability we call “intelligence” or (b) there are lots of different (and mostly unrelated) cognitive abilities, each of which contributes only to specific types of tasks and not others. Framed this way, it just seems obvious that the former view is the right one, and that the latter view has been discredited by the evidence.

The problem is that the psychometric evidence for g stems almost entirely from statistical procedures that aren’t really supposed to be use for causal inference. The primary weapon in the intelligence researcher’s toolbox has historically been principal components analysis (PCA) or exploratory factor analysis, which are really just data reduction techniques. PCA tells you how you can describe your data in a more compact way, but it doesn’t actually tell you what structure is in your data. A good analogy is the use of digital compression algorithms. If you take a directory full of .txt files and compress them into a single .zip file, you’ll almost certainly end up with a file that’s only a small fraction of the total size of the original texts. The reason this works is because certain patterns tend to repeat themselves over and over in .txt files, and a smart algorithm will store an abbreviated description of those patterns rather than the patterns themselves. Which, conceptually, is almost exactly what happens when you run a PCA on a dataset: you’re searching for consistent patterns in the way observations vary along multiple variables, and discarding any redundancy you come across in favor of a more compact description.

Now, in a very real sense, compression is impressive. It’s certainly nice to be able to email your friend a 140kb .zip of your 1200-page novel rather than a 2mb .doc. But note that you don’t actually learn much from the compression. It’s not like your friend can open up that 140k binary representation of your novel, read it, and spare herself the torture of the other 1860kb. If you want to understand what’s going on in a novel, you need to read the novel and think about the novel. And if you want to understand what’s going on in a set of correlations between different cognitive tasks, you need to carefully inspect those correlations and carefully think about those correlations. You can run a factor analysis if you like, and you might learn something, but you’re not going to get any deep insights into the “true” structure of the data. The “true” structure of the data is, by definition, what you started out with (give or take some error). When you run a PCA, you actually get a distorted (but simpler!) picture of the data.

To most people who use PCA, or other data reduction techniques, this isn’t a novel insight by any means. Most everyone who uses PCA knows that in an obvious sense you’re distorting the structure of the data when you reduce its dimensionality. But the use of data reduction is often defended by noting that there must be some reason why variables hang together in such a way that they can be reduced to a much smaller set of variables with relatively little loss of variance. In the context of intelligence, the intuition can be expressed as: if there wasn’t really a single factor underlying intelligence, why would we get such a strong first factor? After all, it didn’t have to turn out that way; we could have gotten lots of smaller factors that appear to reflect distinct types of ability, like verbal intelligence, spatial intelligence, perceptual speed, and so on. But it did turn out that way, so that tells us something important about the unitary nature of intelligence.

This is a strangely compelling argument, but it turns out to be only minimally true. What the presence of a strong first factor does tell you is that you have a lot of positively correlated variables in your data set. To be fair, that is informative. But it’s only minimally informative, because, assuming you eyeballed the correlation matrix in the original data, you already knew that.

What you don’t know, and can’t know, on the basis of a PCA, is what underlying causal structure actually generated the observed positive correlations between your variables. It’s certainly possible that there’s really only one central intelligence system that contributes the bulk of the variance to lots of different cognitive tasks. That’s the g model, and it’s entirely consistent with the empirical data. Unfortunately, it’s not the only one. To the contrary, there are an infinite number of possible causal models that would be consistent with any given factor structure derived from a PCA, including a structure dominated by a strong first factor. In fact, you can have a causal structure with as many variables as you like be consistent with g-like data. So long as the variables in your model all make contributions in the same direction to the observed variables, you will tend to end up with an excessively strong first factor. So you could in principle have 3,000 distinct systems in the human brain, all completely independent of one another, and all of which contribute relatively modestly to a bunch of different cognitive tasks. And you could still get a first factor that accounts for 50% or more of the variance. No g required.

If you doubt this is true, go read Cosma Shalizi’s post, where he not only walks you through a more detailed explanation of the mathematical necessity of this claim, but also illustrates the point using some very simple simulations. Basically, he builds a toy model in which 11 different tasks each draw on several hundred underlying cognitive tasks, which are turn drawn from a larger pool of 2,766 completely independent abilities. He then runs a PCA on the data and finds, lo and behold, a single factor that explains nearly 50% of the variance in scores. Using PCA, it turns out, you can get something huge from (almost) nothing.

Now, at this point a proponent of a unitary g might say, sure, it’s possible that there isn’t really a single cognitive system underlying variation in intelligence; but it’s not plausible, because it’s surely more parsimonious to posit a model with just one variable than a model with 2,766. But that’s only true if you think that our brains evolved in order to make life easier for psychometricians, which, last I checked, wasn’t the case. If you think even a little bit about what we know about the biological and genetic bases of human cognition, it starts to seem really unlikely that there really could be a single central intelligence system. For starters, the evidence just doesn’t support it. In the cognitive neuroscience literature, for example, biomarkers of intelligence abound, and they just don’t seem all that related. There’s a really nice paper in Nature Reviews Neuroscience this month by Deary, Penke, and Johnson that reviews a substantial portion of the literature of intelligence; the upshot is that intelligence has lots of different correlates. For example, people who score highly on intelligence tend to (a) have larger brains overall; (b) show regional differences in brain volume; (c) show differences in neural efficiency when performing cognitive tasks; (d) have greater white matter integrity; (e) have brains with more efficient network structures;  and so on.

These phenomena may not all be completely independent, but it’s hard to believe there’s any plausible story you could tell that renders them all part of some unitary intelligence system, or subject to unitary genetic influence. And really, why should they be part of a unitary system? Is there really any reason to think there has to be a single rate-limiting factor on performance? It’s surely perfectly plausible (I’d argue, much more plausible) to think that almost any complex cognitive task you use as an index of intelligence is going to draw on many, many different cognitive abilities. Take a trivial example: individual differences in visual acuity probably make a (very) small contribution to performance on many different cognitive tasks. If you can’t see the minute details of the stimuli as well as the next person, you might perform slightly worse on the task. So some variance in putatively “cognitive” task performance undoubtedly reflects abilities that most intelligence researchers wouldn’t really consider properly reflective of higher cognition at all. And yet, that variance has to go somewhere when you run a factor analysis. Most likely, it’ll go straight into that first factor, or g, since it’s variance that’s common to multiple tasks (i.e., someone with poorer eyesight may tend to do very slightly worse on any task that requires visual attention). In fact, any ability that makes unidirectional contributions to task performance, no matter how relevant or irrelevant to the conceptual definition of intelligence, will inflate the so-called g factor.

If this still seems counter-intuitive to you, here’s an analogy that might, to borrow Dan Dennett’s phrase, prime your intuition pump (it isn’t as dirty as it sounds). Imagine that instead of studying the relationship between different cognitive tasks, we decided to study the relation between performance at different sports. So we went out and rounded up 500 healthy young adults and had them engage in 16 different sports, including basketball, soccer, hockey, long-distance running, short-distance running, swimming, and so on. We then took performance scores for all 16 tasks and submitted them to a PCA. What do you think would happen? I’d be willing to bet good money that you’d get a strong first factor, just like with cognitive tasks. In other words, just like with g, you’d have one latent variable that seemed to explain the bulk of the variance in lots of different sports-related abilities. And just like g, it would have an easy and parsimonious interpretation: a general factor of athleticism!

Of course, in a trivial sense, you’d be right to call it that. I doubt anyone’s going to deny that some people just are more athletic than others. But if you then ask, “well, what’s the mechanism that underlies athleticism,” it’s suddenly much less plausible to think that there’s a single physiological variable or pathway that supports athleticism. In fact, it seems flatly absurd. You can easily think of dozens if not hundreds of factors that should contribute a small amount of the variance to performance on multiple sports. To name just a few: height, jumping ability, running speed, oxygen capacity, fine motor control, gross motor control, perceptual speed, response time, balance, and so on and so forth. And most of these are individually still relatively high-level abilities that break down further at the physiological level (e.g., “balance” is itself a complex trait that at minimum reflects contributions of the vestibular, visual, and cerebellar systems, and so on.). If you go down that road, it very quickly becomes obvious that you’re just not going to find a unitary mechanism that explains athletic ability. Because it doesn’t exist.

All of this isn’t to say that intelligence (or athleticism) isn’t “real”. Intelligence and athleticism are perfectly real; it makes complete sense, and is factually defensible, to talk about some people being smarter or more athletic than other people. But the point is that those judgments are based on superficial observations of behavior; knowing that people’s intelligence or athleticism may express itself in a (relatively) unitary fashion doesn’t tell you anything at all about the underlying causal mechanisms–how many of them there are, or how they interact.

As Cosma Shalizi notes, it also doesn’t tell you anything about heritability or malleability. The fact that we tend to think intelligence is highly heritable doesn’t provide any evidence in favor of a unitary underlying mechanism; it’s just as plausible to think that there are many, many individual abilities that contribute to complex cognitive behavior, all of which are also highly heritable individually. Similarly, there’s no reason to think our cognitive abilities would be any less or any more malleable depending on whether they reflect the operation of a single system or hundreds of variables. Regular physical exercise clearly improves people’s capacity to carry out all sorts of different activities, but that doesn’t mean you’re only training up a single physiological pathway when you exercise; a whole host of changes are taking place throughout your body.

So, assuming you buy the basic argument, where does that leave us? Depends. From a day-to-day standpoint, nothing changes. You can go on telling your friends that so-and-so is a terrific athlete but not the brightest crayon in the box, and your friends will go on understanding exactly what you meant. No one’s suggesting that intelligence isn’t stable and trait-like, just that, at the biological level, it isn’t really one stable trait.

The real impact of relaxing the view that g is a meaningful construct at the biological level, I think, will be in removing an artificial and overly restrictive constraint on researchers’ theorizing. The sense I get, having done some work on executive control, is that g is the 800-pound gorilla in the room: researchers interested in studying the neural bases of intelligence (or related constructs like executive or cognitive control) are always worrying about how their findings relate to g, and how to explain the fact that there might be dissociable neural correlates of different abilities (or even multiple independent contributions to fluid intelligence). To show you that I’m not making this concern up, and that it weighs heavily on many researchers, here’s a quote from the aforementioned and otherwise really excellent NRN paper by Deary et al reviewing recent findings on the neural bases of intelligence:

The neuroscience of intelligence is constrained by — and must explain — the following established facts about cognitive test performance: about half of the variance across varied cognitive tests is contained in general cognitive ability; much less variance is contained within broad domains of capability; there is some variance in specific abilities; and there are distinct ageing patterns for so-called fluid and crystallized aspects of cognitive ability.

The existence of g creates a complicated situation for neuroscience. The fact that g contributes substantial variance to all specific cognitive ability tests is generally thought to indicate that g contributes directly in some way to performance on those tests. That is, when domains of thinking skill (such as executive function and memory) or specific tasks (such as mental arithmetic and non-verbal reasoning on the Raven’s Progressive Matrices test) are studied, neuroscientists are observing brain activity related to g as well as the specific task activities. This undermines the ability to determine localized brain activities that are specific to the task at hand.

I hope I’ve convinced you by this point that the neuroscience of intelligence doesn’t have to explain why half of the variance is contained in general cognitive ability, because there’s no good evidence that there is such a thing as general cognitive ability (except in the descriptive psychometric sense, which carries no biological weight). Relaxing this artificial constraint would allow researchers to get on with the interesting and important business of identifying correlates (and potential causal determinants) of different cognitive abilities without having to worry about the relation of their finding to some Grand Theory of Intelligence. If you believe in g, you’re going to be at a complete loss to explain how researchers can continually identify new biological and genetic correlates of intelligence, and how the effect sizes could be so small (particularly at a genetic level, where no one’s identified a single polymorphism that accounts for more than a fraction of the observable variance in intelligence–the so called problem of “missing heritability”). But once you discard the fiction of g, you can take such findings in stride, and can set about the business of building integrative models that allow for and explicitly model the presence of multiple independent contributions to intelligence. And if studying the brain has taught us anything at all, it’s that the truth is inevitably more complicated than what we’d like to believe.

functional MRI and the many varieties of reliability

Craig Bennett and Mike Miller have a new paper on the reliability of fMRI. It’s a nice review that I think most people who work with fMRI will want to read. Bennett and Miller discuss a number of issues related to reliability, including why we should care about the reliability of fMRI, what factors influence reliability, how to obtain estimates of fMRI reliability, and what previous studies suggest about the reliability of fMRI. Their bottom line is that the reliability of fMRI often leaves something to be desired:

One thing is abundantly clear: fMRI is an effective research tool that has opened broad new horizons of investigation to scientists around the world. However, the results from fMRI research may be somewhat less reliable than many researchers implicitly believe. While it may be frustrating to know that fMRI results are not perfectly replicable, it is beneficial to take a longer-term view regarding the scientific impact of these studies. In neuroimaging, as in other scientific fields, errors will be made and some results will not replicate.

I think this is a wholly appropriate conclusion, and strongly recommend reading the entire article. Because there’s already a nice write-up of the paper over at Mind Hacks, I’ll content myself to adding a number of points to B&M’s discussion (I talk about some of these same issues in a chapter I wrote with Todd Braver).

First, even though I agree enthusiastically with the gist of B&M’s conclusion, it’s worth noting that, strictly speaking, there’s actually no such thing as “the reliability of fMRI”. Reliability isn’t a property of a technique or instrument, it’s a property of a specific measurement. Because every measurement is made under slightly different conditions, reliability will inevitably vary on a case-by-case basis. But since it’s not really practical (or even possible) to estimate reliability for every single analysis, researchers take necessary short-cuts. The standard in the psychometric literature is to establish reliability on a per-measure (not per-method!) basis, so long as conditions don’t vary too dramatically across samples. For example, once someone “validates” a given self-report measure, it’s generally taken for granted that that measure is “reliable”, and most people feel comfortable administering it to new samples without having to go to the trouble of estimating reliability themselves. That’s a perfectly reasonable approach, but the critical point is that it’s done on a relatively specific basis. Supposing you made up a new self-report measure of depression from a set of items you cobbled together yourself, you wouldn’t be entitled to conclude that your measure was reliable simply because some other self-report measure of depression had already been psychometrically validated. You’d be using an entirely new set of items, so you’d have to go to the trouble of validating your instrument anew.

By the same token, the reliability of any given fMRI measurement is going to fluctuate wildly depending on the task used, the timing of events, and many other factors. That’s not just because some estimates of reliability are better than others; it’s because there just isn’t a fact of the matter about what the “true” reliability of fMRI is. Rather, there are facts about how reliable fMRI is for specific types of tasks with specific acquisition parameters and preprocessing streams in specific scanners, and so on (which can then be summarized by talking about the general distribution of fMRI reliabilities). B&M are well aware of this point, and discuss it in some detail, but I think it’s worth emphasizing that when they say that “the results from fMRI research may be somewhat less reliable than many researchers implicitly believe,” what they mean isn’t that the “true” reliability of fMRI is likely to be around .5; rather, it’s that if you look at reliability estimates across a bunch of different studies and analyses, the estimated reliability is often low. But it’s not really possible to generalize from this overall estimate to any particular study; ultimately, if you want to know whether your data were measured reliably, you need to quantify that yourself. So the take-away message shouldn’t be that fMRI is an inherently unreliable method (and I really hope that isn’t how B&M’s findings get reported by the mainstream media should they get picked up), but rather, that there’s a very good chance that the reliability of fMRI in any given situation is not particularly high. It’s a subtle difference, but an important one.

Second, there’s a common misconception that reliability estimates impose an upper bound on the true detectable effect size. B&M make this point in their review, Vul et al made it in their “voodoo correlations”” paper, and in fact, I’ve made it myself before. But it’s actually not quite correct. It’s true that, for any given test, the true reliability of the variables involved limits the potential size of the true effect. But there are many different types of reliability, and most will generally only be appropriate and informative for a subset of statistical procedures. Virtually all types of reliability estimate will underestimate the true reliability in some cases and overestimate it in others. And in extreme cases, there may be close to zero relationship between the estimate and the truth.

To see this, take the following example, which focuses on internal consistency. Suppose you have two completely uncorrelated items, and you decide to administer them together as a single scale by simply summing up their scores. For example, let’s say you have an item assessing shoelace-tying ability, and another assessing how well people like the color blue, and you decide to create a shoelace-tying-and-blue-preferring measure. Now, this measure is clearly nonsensical, in that it’s unlikely to predict anything you’d ever care about. More important for our purposes, its internal consistency would be zero, because its items are (by hypothesis) uncorrelated, so it’s not measuring anything coherent. But that doesn’t mean the measure is unreliable! So long as the constituent items are each individually measured reliably, the true reliability of the total score could potentially be quite high, and even perfect. In other words, if I can measure your shoelace-tying ability and your blueness-liking with perfect reliability, then by definition, I can measure any linear combination of those two things with perfect reliability as well. The result wouldn’t mean anything, and the measure would have no validity, but from a reliability standpoint, it’d be impeccable. This problem of underestimating reliability when items are heterogeneous has been discussed in the psychometric literature for at least 70 years, and yet you still very commonly see people do questionable things like “correcting for attenuation” based on dubious internal consistency estimates.

In their review, B&M mostly focus on test-retest reliability rather than internal consistency, but the same general point applies. Test-retest reliability is the degree to which people’s scores on some variable are consistent across multiple testing occasions. The intuition is that, if the rank-ordering of scores varies substantially across occasions (e.g., if the people who show the highest activation of visual cortex at Time 1 aren’t the same ones who show the highest activation at Time 2), the measurement must not have been reliable, so you can’t trust any effects that are larger than the estimated test-retest reliability coefficient. The problem with this intuition is that there can be any number of systematic yet session-specific influences on a person’s score on some variable (e.g., activation level). For example, let’s say you’re doing a study looking at the relation between performance on a difficult working memory task and frontoparietal activation during the same task. Suppose you do the exact same experiment with the same subjects on two separate occasions three weeks apart, and it turns out that the correlation between DLPFC activation across the two occasions is only .3. A simplistic view would be that this means that the reliability of DLPFC activation is only .3, so you couldn’t possibly detect any correlations between performance level and activation greater than .3 in DLPFC. But that’s simply not true. It could, for example, be that the DLPFC response during WM performance is perfectly reliable, but is heavily dependent on session-specific factors such as baseline fatigue levels, motivation, and so on. In other words, there might be a very strong and perfectly “real” correlation between WM performance and DLPFC activation on each of the two testing occasions, even though there’s very little consistency across the two occasions. Test-retest reliability estimates only tell you how much of the signal is reliably due to temporally stable variables, and not how much of the signal is reliable, period.

The general point is that you can’t just report any estimate of reliability that you like (or that’s easy to calculate) and assume that tells you anything meaningful about the likelihood of your analyses succeeding. You have to think hard about exactly what kind of reliability you care about, and then come up with an estimate to match that. There’s a reasonable argument to be made that most of the estimates of fMRI reliability reported to date are actually not all that relevant to many people’s analyses, because the majority of reliability analyses have focused on test-retest reliability, which is only an appropriate way to estimate reliability if you’re trying to relate fMRI activation to stable trait measures (e.g., personality or cognitive ability). If you’re interested in relating in-scanner task performance or state-dependent variables (e.g., mood) to brain activation (arguably the more common approach), or if you’re conducting within-subject analyses that focus on comparisons between conditions, using test-retest reliability isn’t particularly informative, and you really need to focus on other types of reliability (or reproducibility).

Third, and related to the above point, between-subject and within-subject reliability are often in statistical tension with one another. B&M don’t talk about this, as far as I can tell, but it’s an important point to remember when designing studies and/or conducting analyses. Essentially, the issue is that what counts as error depends on what effects you’re interested in. If you’re interested in individual differences, it’s within-subject variance that counts as error, so you want to minimize that. Conversely, if you’re interested in within-subject effects (the norm in fMRI), you want to minimize between-subject variance. But you generally can’t do both of these at the same time. If you use a very “strong” experimental manipulation (i.e., a task that produces a very large difference between conditions for virtually all subjects), you’re going to reduce the variability between individuals, and you may very well end up with very low test-retest reliability estimates. And that would actually be a good thing! Conversely, if you use a “weak” experimental manipulation, you might get no mean effect at all, because there’ll be much more variability between individuals. There’s no right or wrong here; the trick is to pick a design that matches the focus of your study. In the context of reliability, the essential point is that if all you’re interested in is the contrast between high and low working memory load, it shouldn’t necessarily bother you if someone tells you that the test-retest reliability of induced activation in your study is close to zero. Conversely, if you care about individual differences, it shouldn’t worry you if activations aren’t reproducible across studies at the group level. In some ways, those are actual the ideal situations for each of those two types of studies.

Lastly, B&M raise a question as to what level of reliability we should consider “acceptable” for fMRI research:

There is no consensus value regarding what constitutes an acceptable level of reliability in fMRI. Is an ICC value of 0.50 enough? Should studies be required to achieve an ICC of 0.70? All of the studies in the review simply reported what the reliability values were. Few studies proposed any kind of criteria to be considered a “˜reliable’ result. Cicchetti and Sparrow did propose some qualitative descriptions of data based on the ICC-derived reliability of results (1981). They proposed that results with an ICC above 0.75 be considered “˜excellent’, results between 0.59 and 0.75 be considered “˜good’, results between .40 and .58 be considered “˜fair’, and results lower than 0.40 be considered “˜poor’. More specifically to neuroimaging, Eaton et al. (2008) used a threshold of ICC > 0.4 as the mask value for their study while Aron et al. (2006) used an ICC cutoff of ICC > 0.5 as the mask value.

On this point, I don’t really see any reason to depart from psychometric convention just because we’re using fMRI rather than some other technique. Conventionally, reliability estimates of around .8 (or maybe .7, if you’re feeling generous) are considered adequate. Any lower and you start to run into problems, because effect sizes will shrivel up. So I think we should be striving to attain the same levels of reliability with fMRI as with any other measure. If it turns out that that’s not possible, we’ll have to live with that, but I don’t think the solution is to conclude that reliability estimates on the order of .5 are ok “for fMRI” (I’m not saying that’s what B&M say, just that that’s what we should be careful not to conclude). Rather, we should just accept that the odds of detecting certain kinds of effects with fMRI are probably going to be lower than with other techniques. And maybe we should minimize the use of fMRI for those types of analyses where reliability is generally not so good (e.g., using brain activation to predict trait variables over long intervals).

I hasten to point out that none of this should be taken as a criticism of B&M’s paper; I think all of these points complement B&M’s discussion, and don’t detract in any way from its overall importance. Reliability is a big topic, and there’s no way Bennett and Miller could say everything there is to be said about it in one paper. I think they’ve done the field of cognitive neuroscience an important service by raising awareness and providing an accessible overview of some of the issues surrounding reliability, and it’s certainly a paper that’s going on my “essential readings in fMRI methods” list.

ResearchBlogging.org
Bennett, C. M., & Miller, M. B. (2010). How reliable are the results from functional magnetic resonance imaging? Annals of the New York Academy of Sciences

Kahneman on happiness

The latest TED talk is an instant favorite of mine. Daniel Kahneman talks about the striking differences in the way we experience versus remember events:

It’s an entertaining and profoundly insightful 20-minute talk, and worth watching even if you think you’ve heard these ideas before.

The fundamental problem Kahneman discusses is that we all experience our lives on a moment-by-moment basis, and yet we make decisions based on our memories of the past. Unfortunately, it turns out that the experiencing self and the remembering self don’t necessarily agree about what things make us happy, and so we often end up in situations where we voluntarily make choices that actually substantially reduce our experienced utility. I won’t give away the examples Kahneman talks about, other than to say that they beautifully illustrate the relevance of psychology (or at least some branches of psychology) to the real-world decisions we all make–both the trival, day-to-day variety, and the rarer, life-or-death kind.

As an aside, Kahneman gave a talk at Brain Camp (or, officially, the annual Summer Institute in Cognitive Neuroscience, which may now be defunct–or perhaps only on hiatus?) the year I attended. There were a lot of great talks that year, but Kahneman’s really stood out for me, despite the fact that he hardly talked about research at all. It was more of a meditation on the scientific method–how to go about building and testing new theories. You don’t often hear a Nobel Prize winner tell an audience that the work that won the Nobel Prize was completely wrong, but that’s essentially what Kahneman claimed. Of course, his point wasn’t that Prospect Theory was useless, but rather, that many of the holes and limitations of the theory that people have gleefully pointed out over the last three decades were already well-recognized at the time the original findings were published. Kahneman and Tversky’s goal wasn’t to produce a perfect description or explanation of the mechanisms underlying human decision-making, but rather, an approximation that made certain important facts about human decision-making clear (e.g., the fact that people simply don’t follow the theory of Expected Utility), and opened the door to entirely new avenues of research. Kahneman seemed to think that ultimately what we really want isn’t a protracted series of incremental updates to Prospect Theory, but a more radical paradigm shift, and that in that sense, clinging to Prospect Theory might now actually be impeding progress.

You might think that’s a pretty pessimistic message–“hey, you can win a Nobel Prize for being completely wrong!”–but it really wasn’t; I actually found it quite uplifting (if Daniel Kahneman feels comfortable being mostly wrong about his ideas, why should the rest of us get attached to ours?). At least, that’s the way I remember it now. But that talk was nearly three years ago, you see, so my actual experience at the time may have been quite different. Turns out you can’t really trust my remembering self; it’ll tell you anything it thinks it wants me to hear.