The homogenization of scientific computing, or why Python is steadily eating other languages’ lunch

Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:

  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.

In 2013, my toolbox looks like this:

  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.

You may notice a theme here.

The increasing homogenization (Pythonification?) of the tools I use on a regular basis primarily reflects the spectacular recent growth of the Python ecosystem. A few years ago, you couldn’t really do statistics in Python unless you wanted to spend most of your time pulling your hair out and wishing Python were more like R (which, is a pretty remarkable confession considering what R is like). Neuroimaging data could be analyzed in SPM (MATLAB-based), FSL, or a variety of other packages, but there was no viable full-featured, free, open-source Python alternative. Packages for machine learning, natural language processing, web application development, were only just starting to emerge.

These days, tools for almost every aspect of scientific computing are readily available in Python. And in a growing number of cases, they’re eating the competition’s lunch.

Take R, for example. R’s out-of-the-box performance with out-of-memory datasets has long been recognized as its achilles heel (yes, I’m aware you can get around that if you’re willing to invest the time–but not many scientists have the time). But even people who hated the way R chokes on large datasets, and its general clunkiness as a language, often couldn’t help running back to R as soon as any kind of serious data manipulation was required. You could always laboriously write code in Python or some other high-level language to pivot, aggregate, reshape, and otherwise pulverize your data, but why would you want to? The beauty of packages like plyr in R was that you could, in a matter of 2 – 3 lines of code, perform enormously powerful operations that could take hours to duplicate in other languages. The downside was the intensive learning curve associated with learning each package’s often quite complicated API (e.g., ggplot2 is incredibly expressive, but every time I stop using ggplot2 for 3 months, I have to completely re-learn it), and having to contend with R’s general awkwardness. But still, on the whole, it was clearly worth it.

Flash forward to The Now. Last week, someone asked me for some simulation code I’d written in R a couple of years ago. As I was firing up R Studio to dig around for it, I realized that I hadn’t actually fired up R studio for a very long time prior to that moment–probably not in about 6 months. The combination of NumPy/SciPy, MatPlotLib, pandas and statmodels had effectively replaced R for me, and I hadn’t even noticed. At some point I just stopped dropping out of Python and into R whenever I had to do the “real” data analysis. Instead, I just started importing pandas and statsmodels into my code. The same goes for machine learning (scikit-learn), natural language processing (nltk), document parsing (BeautifulSoup), and many other things I used to do outside Python.

It turns out that the benefits of doing all of your development and analysis in one language are quite substantial. For one thing, when you can do everything in the same language, you don’t have to suffer the constant cognitive switch costs of reminding yourself say, that Ruby uses blocks instead of comprehensions, or that you need to call len(array) instead of array.length to get the size of an array in Python; you can just keep solving the problem you’re trying to solve with as little cognitive overhead as possible. Also, you no longer need to worry about interfacing between different languages used for different parts of a project. Nothing is more annoying than parsing some text data in Python, finally getting it into the format you want internally, and then realizing you have to write it out to disk in a different format so that you can hand it off to R or MATLAB for some other set of analyses*. In isolation, this kind of thing is not a big deal. It doesn’t take very long to write out a CSV or JSON file from Python and then read it into R. But it does add up. It makes integrated development more complicated, because you end up with more code scattered around your drive in more locations (well, at least if you have my organizational skills). It means you spend a non-negligible portion of your “analysis” time writing trivial little wrappers for all that interface stuff, instead of thinking deeply about how to actually transform and manipulate your data. And it means that your beautiful analytics code is marred by all sorts of ugly open() and read() I/O calls. All of this overhead vanishes as soon as you move to a single language.

Convenience aside, another thing that’s impressive about the Python scientific computing ecosystem is that a surprising number of Python-based tools are now best-in-class (or close to it) in terms of scope and ease of use–and, in virtue of C bindings, often even in terms of performance. It’s hard to imagine an easier-to-use machine learning package than scikit-learn, even before you factor in the breadth of implemented algorithms, excellent documentation, and outstanding performance. Similarly, I haven’t missed any of the data manipulation functionality in R since I switched to pandas. Actually, I’ve discovered many new tricks in pandas I didn’t know in R (some of which I’ll describe in an upcoming post). Considering that pandas considerably outperforms R for many common operations, the reasons for me to switch back to R or other tools–even occasionally–have dwindled.

Mind you, I don’t mean to imply that Python can now do everything anyone could ever do in other languages. That’s obviously not true. For instance, there are currently no viable replacements for many of the thousands of statistical packages users have contributed to R (if there’s a good analog for lme4 in Python, I’d love to know about it). In signal processing, I gather that many people are wedded to various MATLAB toolboxes and packages that don’t have good analogs within the Python ecosystem. And for people who need serious performance and work with very, very large datasets, there’s often still no substitute for writing highly optimized code in a low-level compiled language. So, clearly, what I’m saying here won’t apply to everyone. But I suspect it applies to the majority of scientists.

Speaking only for myself, I’ve now arrived at the point where around 90 – 95% of what I do can be done comfortably in Python. So the major consideration for me, when determining what language to use for a new project, has shifted from what’s the best tool for the job that I’m willing to learn and/or tolerate using? to is there really no way to do this in Python? By and large, this mentality is a good thing, though I won’t deny that it occasionally has its downsides. For example, back when I did most of my data analysis in R, I would frequently play around with random statistics packages just to see what they did. I don’t do that much any more, because the pain of having to refresh my R knowledge and deal with that thing again usually outweighs the perceived benefits of aimless statistical exploration. Conversely, sometimes I end up using Python packages that I don’t like quite as much as comparable packages in other languages, simply for the sake of preserving language purity. For example, I prefer Rails’ ActiveRecord ORM to the much more explicit SQLAlchemy ORM for Python–but I don’t prefer to it enough to justify mixing Ruby and Python objects in the same application. So, clearly, there are costs. But they’re pretty small costs, and for me personally, the scales have now clearly tipped in favor of using Python for almost everything. I know many other researchers who’ve had the same experience, and I don’t think it’s entirely unfair to suggest that, at this point, Python has become the de facto language of scientific computing in many domains. If you’re reading this and haven’t had much prior exposure to Python, now’s a great time to come on board!

Postscript: In the period of time between starting this post and finishing it (two sessions spread about two weeks apart), I discovered not one but two new Python-based packages for data visualization: Michael Waskom’s seaborn package–which provides very high-level wrappers for complex plots, with a beautiful ggplot2-like aesthetic–and Continuum Analytics’ bokeh, which looks like a potential game-changer for web-based visualization**. At the rate the Python ecosystem is moving, there’s a non-zero chance that by the time you read this, I’ll be using some new Python package that directly transliterates my thoughts into analytics code.

 

* I’m aware that there are various interfaces between Python, R, etc. that allow you to internally pass objects between these languages. My experience with these has not been overwhelmingly positive, and in any case they still introduce all the overhead of writing extra lines of code and having to deal with multiple languages.

** Yes, you heard right: web-based visualization in Python. Bokeh generates static JavaScript and JSON for you from Python code, so  your users are magically able to interact with your plots on a webpage without you having to write a single line of native JS code.

R, the master troll of statistical languages

Warning: what follows is a somewhat technical discussion of my love-hate relationship with the R statistical language, in which I somehow manage to waste 2,400 words talking about a single line of code. Reader discretion is advised.

I’ve been using R to do most of my statistical analysis for about 7 or 8 years now–ever since I was a newbie grad student and one of the senior grad students in my lab introduced me to it. Despite having spent hundreds (thousands?) of hours in R, I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.

That said, if I’ve learned one thing about R, it’s that R is all about flexibility: almost any task can be accomplished in a dozen different ways. I don’t mean that in the trivial sense that pretty much any substantive programming problem can be solved in any number of ways in just about any language; I mean that for even very simple and well-defined tasks involving just one or two lines of code there are often many different approaches.

To illustrate, consider the simple task of selecting a column from a data frame (data frames in R are basically just fancy tables). Suppose you have a dataset that looks like this:

In most languages, there would be one standard way of pulling columns out of this table. Just one unambiguous way: if you don’t know it, you won’t be able to work with data at all, so odds are you’re going to learn it pretty quickly. R doesn’t work that way. In R there are many ways to do almost everything, including selecting a column from a data frame (one of the most basic operations imaginable!). Here are four of them:

 

I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).

This type of flexibility enables incredibly powerful, terse code once you know R reasonably well; unfortunately, it also makes for an extremely steep learning curve. You might wonder why that would be–after all, at its core, R still lets you do things the way most other languages do them. In the above example, you don’t have to use anything other than the simple index-based approach (i.e., data[,1]), which is the way most other languages that have some kind of data table or matrix object (e.g., MATLAB, Python/NumPy, etc.) would prefer you to do it. So why should the extra flexibility present any problems?

The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–and nothing is more frustrating to a newbie when learning a language than trying to figure out why sometimes people select columns in a data frame by index and other times they select them by name, or why sometimes people refer to named properties with a dollar sign and other times they wrap them in a vector or double square brackets. There are good reasons to have all of these different idioms, but you wouldn’t know that if you’re new to R and your expectation, quite reasonably, is that if two expressions look very different, they should do very different things. The flexibility that experienced R users love is very confusing to a newcomer. Most other languages don’t have that problem, because there’s only one way to do everything (or at least, far fewer ways than in R).

Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all. But I was reminded of the steepness of that initial learning curve the other day while helping my wife use R to do some regression analyses for her thesis. Rather than explaining what she was doing, suffice it to say that she needed to write a function that, among other things, takes a data frame as input and retains only the numeric columns for subsequent analysis. Data frames in R are actually lists under the hood, so they can have mixed types (i.e., you can have string columns and numeric columns and factors all in the same data frame; R lists basically work like hashes or dictionaries in other loosely-typed languages like Python or Ruby). So you can run into problems if you haphazardly try to perform numerical computations on non-numerical columns (e.g., good luck computing the mean of ‘cat’, ‘dog’, and ‘giraffe’), and hence, pre-emptive selection of only the valid numeric columns is required.

Now, in most languages (including R), you can solve this problem very easily using a loop. In fact, in many languages, you would have to use an explicit for-loop; there wouldn’t be any other way to do it. In R, you might do it like this*:

numeric_cols = rep(FALSE, ncol(ice.cream))
for (i in 1:ncol(ice.cream)) numeric_cols[i] = is.numeric(ice.cream[,i])

We allocate memory for the result, then loop over each column and check whether or not it’s numeric, saving the result. Once we’ve done that, we can select only the numeric columns from our data frame with data[,numeric_cols].

This is a perfectly sensible way to solve the problem, and as you can see, it’s not particularly onerous to write out. But of course, no self-respecting R user would write an explicit loop that way, because R provides you with any number of other tools to do the job more efficiently. So instead of saying “just loop over the columns and check if is.numeric() is true for each one,” when my wife asked me how to solve her problem, I cleverly said “use apply(), of course!”

apply() is an incredibly useful built-in function that implicitly loops over one or more margins of a matrix; in theory, you should be able to do the same work as the above two lines of code with just the following one line:

apply(ice.cream, 2, is.numeric)

Here the first argument is the data we’re passing in, the third argument is the function we want to apply to the data (is.numeric()), and the second argument is the margin over which we want to apply that function (1 = rows, 2 = columns, etc.). And just like that, we’ve cut the length of our code in half!

Unfortunately, when my wife tried to use apply(), her script broke. It didn’t break in any obvious way, mind you (i.e., with a crash and an error message); instead, the apply() call returned a perfectly good vector. It’s just that all of the values in that vector were FALSE. Meaning, R had decided that none of the columns in my wife’s data frame were numeric–which was most certainly incorrect. And because the code wasn’t throwing an error, and the apply() call was embedded within a longer function, it wasn’t obvious to my wife–as an R newbie and a novice programmer–what had gone wrong. From her perspective, the regression analyses she was trying to run with lm() were breaking with strange messages. So she spent a couple of hours trying to debug her code before asking me for help.

Anyway, I took a look at the help documentation, and the source of the problem turned out to be the following: apply() only operates over matrices or vectors, and not on data frames. So when you pass a data frame to apply() as the input, it’s implicitly converted to a matrix. Unfortunately, because matrices can only contain values of one data type, any data frame that has at least one string column will end up being converted to a string (or, in R’s nomenclature, character) matrix. And so now when we apply the is.numeric() function to each column of the matrix, the answer is always going to be FALSE, because all of the columns have been converted to character vectors. So apply() is actually doing exactly what it’s supposed to; it’s just that it doesn’t deign to tell you that it’s implicitly casting your data frame to a matrix before doing anything else. The upshot is that unless you carefully read the apply() documentation and have a basic understanding of data types (which, if you’ve just started dabbling in R, you may well not), you’re hosed.

At this point I could have–and probably should have–thrown in the towel and just suggested to my wife that she use an explicit loop. But that would have dealt a mortal blow to my pride as an experienced-if-not-yet-guru-level R user. So of course I did what any self-respecting programmer does: I went and googled it. And the first thing I came across was the all.is.numeric() function in the Hmisc package which has the following description:

Tests, without issuing warnings, whether all elements of a character vector are legal numeric values.

Perfect! So now the solution to my wife’s problem became this:

library(Hmisc)
apply(ice.cream, 2, all.is.numeric)

…which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution. So I googled some more. And came across a relevant Stack Exchange answer, which had the following simple solution to my wife’s exact problem:

sapply(ice.cream, is.numeric)

You’ll notice that this is virtually identical to the apply() approach that crashed. That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. And since data frames are actually lists, there’s no problem passing in a data frame and iterating over its columns. So just like that, we have an elegant one-line solution to the original problem that doesn’t invoke any loops or third-party packages.

Now, having used apply() a million times, I probably should have known about sapply(). And actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart. In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. In other words, in 2012, I’m the kind of experienced R user that you might generously call “not very good at R”, and, less generously, “dumb”.

On the plus side, the end product is undeniably cool, right? There are very few languages in which you could achieve so much functionality so compactly right out of the box. And this isn’t an isolated case; base R includes a zillion high-level functions to do similarly complex things with data in a fraction of the code you’d need to write in most other languages. Once you throw in the thousands of high-quality user-contributed packages, there’s nothing else like it in the world of statistical computing.

Anyway, this inordinately long story does have a point to it, I promise, so let me sum up:

  • If I had just ignored the desire to be efficient and clever, and had told my wife to solve the problem the way she’d solve it in most other languages–with a simple for-loop–it would have taken her a couple of minutes to figure out, and she’d probably never have run into any problems.
  • If I’d known R slightly better, I would have told my wife to use sapply(). This would have taken her 10 seconds and she’d definitely never have run into any problems.
  • BUT: because I knew enough R to be clever but not enough R to avoid being stupid, I created an entirely avoidable problem that consumed a couple of hours of my wife’s time. Of course, now she knows about both apply() and sapply(), so you could argue that in the long run, I’ve probably still saved her time. (I’d say she also learned something about her husband’s stubborn insistence on pretending he knows what he’s doing, but she’s already the world-leading expert on that topic.)

Anyway, this anecdote is basically a microcosm of my entire experience with R. I suspect many other people will relate. Basically what it boils down to is that R gives you a certain amount of rope to work with. If you don’t know what you’re doing at all, you will most likely end up accidentally hanging yourself with that rope. If, on the other hand, you’re a veritable R guru, you will most likely use that rope to tie some really fancy knots, scale tall buildings, fashion yourself a space tuxedo, and, eventually, colonize brave new statistical worlds. For everyone in between novice and guru (e.g., me), using R on a regular basis is a continual exercise in alternately thinking “this is fucking awesome” and banging your head against the wall in frustration at the sheer stupidity (either your own, or that of the people who designed this awful language). But the good news is that the longer you use R, the more of the former and the fewer of the latter experiences you have. And at the end of the day, it’s totally worth it: the language is powerful enough to make you forget all of the weird syntax, strange naming conventions, choking on large datasets, and issues with data type conversions.

Oh, except when your wife is yelling at gently reprimanding you for wasting several hours of her time on a problem she could have solved herself in 5 minutes if you hadn’t insisted that she do it the idiomatic R way. Then you remember exactly why R is the master troll of statistical languages.

 

 

* R users will probably notice that I use the = operator for assignment instead of the <- operator even though the latter is the officially prescribed way to do it in R (i.e., a <- 2 is favored over a = 2). That’s because these two idioms are interchangeable in all but one (rare) use case, and personally I prefer to avoid extra keystrokes whenever possible. But the fact that you can do even basic assignment in two completely different ways in R drives home the point about how pathologically flexible–and, to a new user, confusing–the language is.

correlograms are correlicious

In the last year or so, I’ve been experimenting with different ways of displaying correlation matrices, and have gotten very fond of color-coded correlograms. Here’s one from a paper I wrote investigating the relationship between personality and word use among bloggers (click to enlarge):

Figure S2 Extraversion

The rows reflect language categories from Jamie Pennebaker’s Linguistic Inquiry and Word Count (LIWC) dictionary; the columns reflect Extraversion scores (first column) or scores on the lower-order “facets” of Extraversion (as measured by the IPIP version of the NEO-PI-R). The plot was generated in R using code adapted from the corrgram package (R really does have contributed packages for everything). Positive correlations are in blue, negative ones are in red.

The thing I really like about these figures is that the colors instantly orient you to the most important features of the correlation matrix, instead of having to inspect every cell for the all-important ***magical***asterisks***of***statistical***significance***. For instance, a cursory glance tells you that even though Excitement-Seeking and Cheerfulness are both nominally facets of Extraversion, they’re associated with very different patterns of word use. And then a slightly less cursory glance tells you that that’s because people with high Excitement-Seeking scores like to swear a lot and use negative emotion words, while Cheerful people like to talk about friends, music, and use positive emotional language. You’d get the same information without the color, of course, but it’d take much longer to extract,  and then you’d have to struggle to keep all of the relevant numbers in mind while you mull them over. The colors do a lot to reduce cognitive load, and also have the secondary benefit of looking pretty.

If you’re interested in using correlograms, a good place to start is the Quick-R tutorial on correlograms in R. The documentation for the corrgram package is here, and there’s a nice discussion of the principles behind the visual display of correlation matrices in this article.

p.s. I’m aware this post has the worst title ever; the sign-up sheet for copy editing duties is in the comment box (hint hint).

abbreviating personality measures in R: a tutorial

A while back I blogged about a paper I wrote that uses genetic algorithms to abbreviate personality measures with minimal human intervention. In the paper, I promised to put the R code I used online, so that other people could download and use it. I put off doing that for a long time, because the code was pretty much spaghetti by the time the paper got accepted, and there are any number of things I’d rather do than spend a weekend rewriting my own code. But one of the unfortunate things about publicly saying that you’re going to do something is that you eventually have to do that something. So, since the paper was published in JRP last week, and several people have emailed me to ask for the code, I spent much of the weekend making the code presentable. It’s not a fully-formed R package yet, but it’s mostly legible, and seems to work more or less ok. You can download the file (gaabbreviate.R) here. The rest of this (very long) post is basically a tutorial on how to use the code, so you probably want to stop reading this now unless you have a burning interest in personality measurement.

Prerequisites and installation

Although you won’t need to know much R to follow this tutorial, you will need to have R installed on your system. Fortunately, R is freely available for all major operating systems. You’ll also need the genalg and psych packages for R, because gaabbreviate won’t run without them. Once you have R installed, you can download and install those packages like so:

install.packages(c(‘genalg’, ‘psych’))

Once that’s all done, you’re ready to load gaabbreviate.R:

source(“/path/to/the/file/gaabbreviate.R”)

…where you make sure to specify the right path to the location where you saved the file. And that’s it! Now you’re ready to abbreviate measures.

Reading in data

The file contains several interrelated functions, but the workhorse is gaa.abbreviate(), which takes a set of item scores and scale scores for a given personality measure as input and produces an abbreviated version of the measure, along with a bunch of other useful information. In theory, you can go from old data to new measure in a single line of R code, with almost no knowledge of R required (though I think it’s a much better idea to do it step-by-step and inspect the results at every stage to make sure you know what’s going on).

The abbreviation function is pretty particular about the format of the input it expects. It takes two separate matrices, one with item scores, the other with scale scores (a scale here just refers to any set of one or more items used to generate a composite score). Subjects are in rows, item or scale scores are in columns. So for example, let’s say you have data from 3 subjects, who filled out a personality measure that has two separate scales, each composed of two items. Your item score matrix might look like this:

3 5 1 1

2 2 4 1

2 4 5 5

…which you could assign in R like so:

items = matrix(c(3,2,2,5,2,2,1,4,5,1,1,5), ncol=3)

I.e., the first subject had scores of 3, 5, 1, and 1 on the four items, respectively; the second subject had scores of 2, 2, 4, and 1… and so on.

Based on the above, if you assume items 1 and 2 constitute one scale, and items 3 and 4 constitute the other, the scale score matrix would be:

8 2

4 5

6 10

Of course, real data will probably have hundreds of subjects, dozens of items, and a bunch of different scales, but that’s the basic format. Assuming you can get your data into an R matrix or data frame, you can feed it directly to gaa.abbreviate() and it will hopefully crunch your data without complaining. But if you don’t want to import your data into R before passing it to the code, you can also pass filenames as arguments instead of matrices. For example:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100)

If you pass files instead of data, the referenced text files must be tab-delimited, with subjects in rows, item/scale scores in columns, and a header row that gives the names of the columns (i.e., item names and scale names; these can just be numbers if you like, but they have to be there). Subject identifiers should not be in the files.

Key parameters: stuff you should set every time

Assuming you can get gaabbreviate to read in your data, you can then set about getting it to abbreviate your measure by selecting a subset of items that retain as much of the variance in the original scales as possible. There are a few parameters you’ll need to set; some are mandatory, others aren’t, but should really be specified anyway since the defaults aren’t likely to work well for different applications.

The most important (and mandatory) argument is iters, which is the number of iterations you want the GA to run for. If you pick too high a number, the GA may take a very long time to run if you have a very long measure; if you pick too low a number, you’re going to get a crappy solution. I think iters=100 is a reasonable place to start, though in practice, obtaining a stable solution tends to require several hundred iterations. The good news (which I cover in more detail below) is that you can take the output you get from the abbreviation function and feed it right back in as many times as you want, so it’s not like you need to choose the number of iterations carefully or anything.

The other two key parameters are itemCost and maxItems. The itemCost is what determines the degree to which your measure is compressed. If you want a detailed explanation of how this works, see the definition of the cost function in the paper. Very briefly, the GA tries to optimize the trade-off between number of items and amount of variance explained. Generally speaking, the point of abbreviating a measure is to maximize the amount of explained variance (in the original scale scores) while minimizing the number of items retained. Unfortunately, you can’t do both very well at the same time, because any time you drop an item, you’re also losing its variance. So the trick is to pick a reasonable compromise: a measure that’s relatively short and still does a decent job recapturing the original. The itemCost parameter is what determines the length of that measure. When you set it high, the GA will place a premium on brevity, resulting in a shorter (but less accurate) measure; when you set it low, it’ll allow a longer measure that maximizes fidelity. The optimal itemCost will vary depending on your data, but I find 0.05 is a good place to start, and then you can tweak it to get measures with more or fewer items as you see fit.

The maxItems parameter sets the upper bound on the number of items that will be used to score each scale. The default is 5, but you may find this number too small if you’re trying to abbreviate scales comprised of a large number of items. Again, it’s worth playing around with this to see what happens. Generally speaks, the same trade-off between brevity and fidelity discussed above holds here too.

Given reasonable values for the above arguments, you should be able to feed in raw data and get out an abbreviated measure with minimal work. Assuming you’re reading your data from a file, the entire stream can be as simple as:

gaa = gaa.abbreviate(items=”someFileWithItemScores.txt”, scales=”someFileWithScaleScores.txt”, iters=100, itemCost=0.05, maxItems=5, writeFile=’outputfile.txt’)

That’s it! Assuming your data are in the correct format (and if they’re not, the script will probably crash with a nasty error message), gaabbreviate will do its thing and produce your new, shorter measure within a few minutes or hours, depending on the size of the initial measure. The writeFile argument is optional, and gives the name of an output file you want the measure saved to. If you don’t specify it, the output will be assigned to the gaa object in the above call (note the “gaa = ” part of the call), but won’t be written to file. But that’s not a problem, because you can always achieve the same effect later by calling the gaa.writeMeasure function (e.g., in the above example, gaa.writeMeasure(gaa, file=”outputfile.txt”) would achieve exactly the same thing).

Other important options

Although you don’t really need to do anything else to produce abbreviated measures, I strongly recommend reading the rest of this document and exploring some of the other options if you’re planning to use the code, because some features are non-obvious. Also, the code isn’t foolproof, and it can do weird things with your data if you’re not paying attention. For one thing, by default, gaabbreviate will choke on missing values (i.e., NAs). You can do two things to get around this: either enable pairwise processing (pairwise=T), or turn on mean imputation (impute=T). I say you can do these things, but I strongly recommend against using either option. If you have missing values in your data, it’s really a much better idea to figure out how to deal with those missing values before you run the abbreviation function, because the abbreviation function is dumb, and it isn’t going to tell you whether pairwise analysis or imputation is a sensible thing to do. For example, if you have 100 subjects with varying degrees of missing data, and only have, say, 20 subjects’ scores for some scales, the resulting abbreviated measure is going to be based on only 20 subjects’ worth of data for some scales if you turn pairwise processing on. Similarly, imputing the mean for missing values is a pretty crude way to handle missing data, and I only put it in so that people who just wanted to experiment with the code wouldn’t have to go to the trouble of doing it themselves. But in general, you’re much better off reading your item and scale scores into R (or SPSS, or any other package), processing any missing values in some reasonable way, and then feeding gaabbreviate the processed data.

Another important point to note is that, by default, gaabbreviate will cross-validate its results. What that means is that only half of your data will be used to generate an abbreviated measure; the other half will be used to provide unbiased estimates of how well the abbreviation process worked. There’s an obvious trade-off here. If you use the split-half cross-validation approach, you’re going to get more accurate estimates of how well the abbreviation process is really working, but the fit itself might be slightly poorer because you have less data. Conversely, if you turn cross-validation off (crossVal=F), you’re going to be using all of your data in the abbreviation process, but the resulting estimates of the quality of the solution will inevitably be biased because you’re going to be capitalizing on chance to some extent.

In practice, I recommend always leaving cross-validation enabled, unless you either (a) really don’t care about quality control (which makes you a bad person), or (b) have a very small sample size, and can’t afford to leave out half of the data in the abbreviation process (in which case you should consider collecting more data). My experience has been that with 200+ subjects, you generally tend to see stable solutions even when leaving cross-validation on, though that’s really just a crude rule of thumb that I’m pulling out of my ass, and larger samples are always better.

Other less important options

There are a bunch other less important options that I won’t cover in any detail here, but that are reasonably well-covered in the comments in the source file if you’re so inclined. Some of these are used to control the genetic algorithm used in the abbreviation process. The gaa.abbreviate function doesn’t actually do the heavy lifting itself; instead, it relies on the genalg library to run the actual genetic algorithm. Although the default genalg parameters will work fine 95% of the time, if you really want to manually set the size of the population or the ratio of initial zeros to ones, you can pass those arguments directly. But there’s relatively little reason to play with these parameters, because you can always achieve more or less the same ends simply by adding iterations.

Two other potentially useful options I won’t touch on, though they’re there if you want them, give you the ability to (a) set a minimum bound on the correlation required in order for an item to be included in the scoring equation for a scale (the minR argument), and (b) apply non-unit weightings to the scales (the sWeights argument), in cases where you want to emphasize some scales at the cost of others (i.e., because you want to measure some scales more accurately).

Two examples

The following two examples assume you’re feeding in item and scale matrices named myItems and myScales, respectively:

example1

This will run a genetic algorithm for 500 generations on mean-imputed data with cross-validation turned off, and assign the result to a variable named my.new.shorter.measure. It will probably produce an only slightly shorter measure, because the itemCost is low and up to 10 items are allowed to load on each scale.

example2

This will run 100 iterations with cross-validation enabled (the default, so we don’t need to specify it explicitly) and write the result to a file named shortMeasure.txt. It’ll probably produce a highly abbreviated measure, because the itemCost is relatively high. It also assigns more weight (twice as much, in fact) to the fourth and fifth scales in the measure relative to the first three, as reflected in the sWeights argument (a vector where the ith element indicates the weight of the ith scale in the measure, so presumably there are five scales in this case).

The gaa object

Assuming you’ve read this far, you’re probably wondering what you get for your trouble once you’ve run the abbreviation function. The answer is that you get… a gaa (which stands for GA Abbreviate) object. The gaa object contains almost all the information that was used at any point in the processing, which you can peruse at your leisure. If you’re familiar with R, you’ll know that you can see what’s in the object with the attributes function. For example, if you assigned the result of the abbreviation function to a variable named ‘myMeasure’, here’s what you’d see:

attributes

The gaa object has several internal lists (data, settings, results, etc.), each of which in turn contains several other variables. I’ve tried to give these sensible names. In brief:

  • data contains all the data used to create the measure (i.e., the item and scale scores you fed in)
  • settings contains all the arguments you specified when you called the abbreviation function (e.g., iters, maxItems, etc.)
  • results contains variables summarizing the results of the GA run, including information about each previous iteration of the GA
  • best contains information about the single best measure produced (this is generally not useful, and is for internal purposes)
  • rbga is the rbga.bin object produced by the genetic library (for more information, see the genalg library documentation)
  • measure is what you’ll probably find most important, as it contains the details of the final measure that was produced

To see the contents of each of these lists in turn, you can easily inspect them:

measure

So the ‘measure’ attribute in the gaa object contains a bunch of other variables with information about the resulting measure. And here’s a brief summary:

  • items: a vector containing the numerical ID of items retained in the final measure relative to the original list (e.g., if you fed in 100 items, and the ‘items’ variable contained the numbers 4, 10, 14… that’s the GA telling you that it decided to keep items no. 4, 10, 14, etc., from the original set of 100 items).
  • nItems: the number of items in the final measure.
  • key: a scoring key for the new measure, where the rows are items on the new measure, and the columns are the scales. This key is compatible with score.items() in Bill Revelle’s excellent psych package, which means that once you’ve got the key, you can automatically score data for the new measure simply by calling score.items() (see the documentation for more details), and don’t need to do any manual calculating or programming yourself.
  • ccTraining and ccValidation: convergent correlations for the training and validation halves of the data, respectively. The convergent correlation is the correlation between the new scale scores (i.e., those that you get using the newly-generated measure) and the “real” scale scores. The ith element in the vector gives you the convergent correlation for the ith scale in your original measure. The validation coefficients will almost invariably be lower than the training coefficients, and the validation numbers are the ones you should trust as an unbiased estimate of the quality of the measure.
  • alpha: coefficient alpha for each scale. Note that you should expect to get lower internal consistency estimates for GA-produced measures than you’re used to, and this is actually a good thing. If you want to know why, read the discussion in the paper.
  • nScaleItems: a vector containing the number of items used to score each scale. If you left minR set to 0, this will always be identical to maxItems for all items. If you raised minR, the number of items will sometimes be lower (i.e., in cases where there were very few items that showed a strong enough correlation to be retained).

Just give me the measure already!

Supposing you’re not really interested in plumbing the depths of the gaa object or working within R more than is necessary, you might just be wondering what the quickest way to get an abbreviated measure you can work with is. In that case, all you really need to do is pass a filename in the writeFile argument when you call gaa.abbreviate (see the examples given above), and you’ll get out a plain text file that contains all the essential details of the new measure. Specifically you’ll get (a) a mapping from old items to new, so that you can figure out which items are included in the new measure (e.g., a line like “4 45” means that the 4th item on the new measure is no. 45 in the original set of items), and (b) a human-readable scoring key for each scale (the only thing to note here is that an “R” next to an item indicates the item is reverse-keyed), along with key statistics (coefficient alpha and convergent correlations for the training and validation halves). So if all goes well, you really won’t need to do anything else in R beyond call that one line that makes the measure. But again, I’d strongly encourage you to carefully inspect the gaa object in R to make sure everything looks right. The fact that the abbreviation process is fully automated isn’t a reason to completely suspend all rational criteria you’d normally use when developing a scale; it just means you probably have to do substantially less work to get a measure you’re happy with.

Killing time…

Depending on how big your dataset is (actually, mainly the number of items in the original measure), how many iterations you’ve requested, and how fast your computer is, you could be waiting a long time for the abbreviation function to finish its work. Because you probably want to know what the hell is going on internally during that time, I’ve provided a rudimentary monitoring display that will show you the current state of the genetic algorithm after every iteration. It looks like this (click for a larger version of the image):

display

This is admittedly a pretty confusing display, and Edward Tufte would probably murder several kittens if he saw it, but it’s not supposed to be a work of art, just to provide some basic information while you’re sitting there twiddling your thumbs (ok, ok, I promise I’ll label the panels better when I have the time to work on it). But basically, it shows you three things. The leftmost three panels show you the basic information about the best measure produced by the GA as it evolves across generations. Respectively, the top, middle,and bottom panels show you the total cost, measure length, and mean variance explained (R^2) as a function of iteration. The total cost can only ever go down, but the length and R^2 can go up or down (though there will tend to be a consistent trajectory for measure length that depends largely on what itemCost you specified).

The middle panel shows you detailed information about how well the GA-produced measure captures variance in each of the scales in the original measure. In this case, I’m abbreviating the 30 facets of the NEO-PI-R. The red dot displays the amount of variance explained in each trait, as of the current iteration.

Finally, the rightmost panel shows you a graphical representation of which items are included in the best measure identified by the GA at each iteration.Each row represents one iteration (i.e., you’re seeing the display as it appears after 200 iterations of a 250-iteration run); black bars represent items that weren’t included, white bars represent items that were included. The point of this display isn’t to actually tell you which items are being kept (you can’t possibly glean that level of information at this resolution), but rather, to give you a sense of how stable the solution is. If you look at the the first few (i.e., topmost) iterations, you’ll see that the solution is very unstable: the GA is choosing very different items as the “best” measure on each iteration. But after a while, as the GA “settles” into a neighborhood, the solution stabilizes and you see only relatively small (though still meaningful) changes from generation to generation. Basically, once the line in the top left panel (total cost) has asymptoted, and the solution in the rightmost panel is no longer changing much if at all, you know that you’ve probably arrived at as good a solution as you’re going to get.

Incidentally, if you use the generic plot() method on a completed gaa object (e.g., plot(myMeasure)), you’ll get exactly the same figure you see here, with the exception that the middle figure will also have black points plotted alongside the red ones.  The black points show you the amount of variance explained in each trait for the cross-validated results. If you’re lucky, the red and black points will be almost on top of each other; if you’re not, the black ones will be considerably to the left of the red ones .

Consider recycling

The last thing I’ll mention, which I already alluded to earlier, is that you can recycle gaa objects. That’s to say, suppose you ran the abbreviation for 100 iterations, only to get back a solution that’s still clearly suboptimal (i.e., the cost function is still dropping rapidly). Rather than having to start all over again, you can simply feed the gaa object back into the abbreviation function in order to run further iterations. And you don’t need to specify any additional parameters (assuming you want to run the same number of iterations you did last time; otherwise you’ll need to specify iters); all of the settings are contained within the gaa object itself. So, assuming you ran the abbreviation function and stored the result in ‘myMeasure’, you can simply do:

myMeasure = gaa.abbreviate(myMeasure, iters=200)

and you’ll get an updated version of the measure that’s had the benefit of an extra 200 iterations. And of course, you can save and load R objects to/from files, so that you don’t need to worry about all of your work disappearing next time you start R. So save(myMeasure, ‘filename.txt’) will save your gaa object for future use, and the next time you need it, you can call myMeasure = load(‘filename.txt’) to get it back (alternatively, you can just save the entire workspace).

Anyway, I think that covers all of the important stuff. There are a few other things I haven’t documented here, but if you’ve read this far, and have gotten the code to work in R, you should be able to start abbreviating your own measures relatively painlessly. If you do use the code to generate shorter measures, and end up with measures you’re happy with, I’d love to hear about it. And if you can’t get the code to work, or can get it to work but are finding issues with the code or the results, I guess I’ll grudgingly accept those emails too. In general, I’m happy to provide support for the code via email provided I have the time. The caveat is that, if you’re new to R, and are having problems with basic things like installing packages or loading files from source, you should really read a tutorial or reference that introduces you to R (Quick-R is my favorite place to start) before emailing me with problems. But if you’re having problems that are specific to the gaabbreviate code (e.g., you’re getting a weird error message, or aren’t sure what something means), feel free to drop me a line and I’ll try to respond as soon as I can.

got R? get social science for R!

Drew Conway has a great list of 10 must-have R packages for social scientists. If you’re a social scientist (or really, any kind of scientist) who doesn’t use R, now is a great time to dive in and learn; there are tons of tutorials and guides out there (my favorite is Quick-R, which is incredibly useful incredibly often), and packages are available for just about any application you can think of. Best of all, R is completely free, and is available for just about every platform. Admittedly, there’s a fairly steep learning curve if you’re used to GUI-based packages like SPSS (R’s syntax can be pretty idiosyncratic), but it’s totally worth the time investment, and once you’re comfortable with R you’ll never look back.

Anyway, Drew’s list contains a number of packages I’ve found invaluable in my work, as well as several packages I haven’t used before and am pretty eager to try. I don’t have much to add to his excellent summaries, but I’ll gladly second the inclusion of ggplot2 (the easiest way in the world to make beautiful graphs?) and plyr and sqldf (great for sanitizing, organizing, and manipulating large data sets, which are often a source of frustration in R). Most of the other packages I haven’t had any reason to use personally, though a few seem really cool, and worth finding an excuse to play around with (e.g., Statnet and igraph).

Since Drew’s list focuses on packages useful to social scientists in general, I thought I’d mention a couple of others that I’ve found particularly useful for psychological applications. The most obvious one is William Revelle‘s awesome psych package, which contains tons of useful functions for descriptive statistics, data reduction, simulation, and psychometrics. It’s saved me me tons of time validating and scoring personality measures, though it probably isn’t quite as useful if you don’t deal with individual difference measures regularly. Other packages I’ve found useful are sem for structural equation modeling (which interfaces nicely with GraphViz to easily produce clean-looking path diagrams), genalg for genetic algorithms, MASS (mostly for sampling from multivariate distributions), reshape (similar functionality to plyr), and car, which contains a bunch of useful regression-related functions (e.g., for my dissertation, I needed to run SPSS-like repeated measures ANOVAs in R, which turns out to be a more difficult proposition than you’d imagine, but was handled by the Anova function in car). I’m sure there are others I’m forgetting, but those are the ones that I’ve relied on most heavily in recent work. No doubt there are tons of other packages out there that are handly for common psychology applications, so if there are any you use regularly, I’d love to hear about them in the comments!