The parable of the three districts: A projective test for psychologists

A political candidate running for regional public office asked a famous political psychologist what kind of television ads she should air in three heavily contested districts: positive ones emphasizing her own record, or negative ones attacking her opponent’s record.

“You’re in luck,“ said the psychologist. “I have a new theory of persuasion that addresses exactly this question. I just published a paper containing four large studies that all strongly support the theory and show that participants are on average more persuaded by attack ads than by positive ones.“

Convinced by the psychologist’s arguments and his confident demeanor, the candidate’s campaign ran carefully tailored attack ads in all three districts. She proceeded to lose the race by a landslide, with exit surveys placing much of the blame on the negative tone of her ads.

As part of the campaign post-mortem, the candidate asked the psychologist what he thought had gone wrong.

“Oh, different things,“ said the psychologist. “In hindsight, the first district was probably too educated; I could see how attack ads might turn off highly educated voters. In the second district““and I’m not going to tiptoe around the issue here—I think the problem was sexism. You have a lot of low-SES working-class men in that district who probably didn’t respond well to a female candidate publicly criticizing a male opponent. And in the third district, I think the ads you aired were just too over the top. You want to highlight your opponent’s flaws subtly, not make him sound like a cartoon villain.“

“That all sounds reasonable enough,“ said the candidate. “But I’m a bit perplexed that you didn’t mention any of these subtleties ahead of time, when they might have been more helpful.“

“Well,“ said the psychologist. “That would have been very hard to do. The theory is true in general, you see. But every situation is different.“

The great European capitals of North America

There are approximately 25 communities named Athens in North America. I say “approximately”, because it depends on how you count. Many of the American Athenses are unincorporated communities, and rely for their continued existence not on legal writ, but on social agreement or collective memory. Some no longer exist at all, having succumbed to the turbulence of the Western gold rush (Athens, Nevada) or given way to a series of devastating fires (Athens, Kentucky). Most are—with apologies to their residents—unremarkable. Only one North American Athens has ever made it (relatively) big: Athens, Georgia, home of the University of Georgia—a city whose population of 120,000 is pretty large for a modern-day American college town, but was surpassed by the original Athens some time around 500 BC.

The reasons these communities were named Athens have, in many cases, been lost to internet time (meaning, they can’t be easily discerned via five minutes of googling). But the modal origin story, among the surviving Athenses with reliable histories (i.e., those with a “history” section in their Wikipedia entry), is exactly what you might expect: some would-be 19th century colonialist superheroes (usually white and male) heard a few good things about some Ancient Greek gentlemen named Socrates, Plato, and Aristotle, and decided that the little plot of land they had just managed to secure from the governments of the United States or Canada was very much reminiscent of the hallowed grounds on which the Platonic Academy once stood. It was presumably in this spirit that the residents of Farmersville, Ontario, for instance, decided in 1888 to changed their town’s name to Athens—a move designed to honor the town’s enviable status as an emerging leader in scholastic activity, seeing as how it had succeeded in building itself both a grammar school and a high school.

It’s safe to say that none of the North American Athenses—including the front-running Georgian candidate—have quite lived up to the glory of their Greek namesake. Here, at least, is one case where statistics do not lie: if you were to place the entire global population in a (very large) urn, and randomly sample people from that urn until you picked out someone who claimed they were from a place called Athens, there would be a greater than 90% probability that the Athens in question would be located in Greece. Most other European capitals would give you a similar result. The second largest Rome in the world, as far as I can tell, is Rome, Georgia, which clocks in at 36,000 residents. Moscow, Idaho boasts 24,000 inhabitants; Amsterdam, New York has 18,000 (we’ll ignore, for purposes of the present argument, that aberration formerly known as New Amsterdam).

Of course, as with any half-baked generalization, there are some notable exceptions. A case in point: London, Ontario (“the 2nd best London in the world”). Having spent much of my youth in Ottawa, Ontario—a mere 6 hour drive away from the Ontarian London—I can attest that when someone living in the Quebec City-Windsor corridor tells you they’re “moving to London”, the inevitable follow-up question is “which one?”

London, Ontario is hardly a metropolis. Even on a good day (say, Sunday, when half of the population isn’t commuting to the Greater Toronto Area for work), its metro population is under half a million. Still, when you compare it to its nomenclatorial cousins, London, Ontario stands out as a 60-pound baboon in the room (though it isn’t the 800-pound gorilla; that honor goes to St. Petersburg, Florida). For perspective, the third biggest London in the world appears to be London, Ohio—population 10,000. I’ve visited London, Ontario, and I know quite a few people who have lived there, but I will almost certainly go my entire life without ever meeting anyone born in London, Ohio—or, for that matter, any London other than the ones in the UK and Ontario.

What about the other great European capitals? Most of them are more like Athens than London. Some of them have more imitators than others; Paris, for example, has at least 30 namesakes worldwide. In some years, Paris is the most visited city in the world, so maybe this isn’t surprising. Many people visit Paris and fall in love with the place, so perhaps it’s inevitable that a handful of very single-minded visitors should decide that if they can’t call Paris home, they can at least call home Paris. And so we have Paris, Texas (population 25,000), Paris, Tennessee (pop. 10,000), and Paris, Michigan (pop. 3,200). All three are small and relatively rural, yet each manages to proudly feature its own replica of the Eiffel Tower. (Mind you, none of these Eiffel replicas are anywhere near as large as the half-scale behemoth that looms over the Las Vegas Strip—that quintessentially American street that has about as much in common with the French capital’s roadways as Napoleon has with Flavor Flav.)

But forget the Parises; let’s talk about the Berlins. A small community named Berlin can seemingly be found in every third tree hollow or roadside ditch in the United States—a reminder that fully one in every seven Americans claim to be of German extraction. It’s easy to forget that, prior to 1900, German-language instruction was offered at hundreds of public elementary schools across the country. One unsurprising legacy of having so many Germans in the United States is that we also have a lot of German capitals. In fact, there are so many Berlins in America that quite a few states have more than one. Wisconsin has two separate towns named Berlin—one in Green Lake County (pop. 5,500), and one in Marathon County (pop. < 1,000)—as well as a New Berlin (pop. 40,000) bigger than both of the two plain old Berlins combined. Search Wikipedia for Berlin, Michigan, and you’ll stumble on a disambiguation entry that features no fewer than 4 places: Marne (formerly known as Berlin), Berlin Charter Township, Berlin Township, and Berlin Township. No, that last one isn’t a typo.

Berlin’s ability to inject itself so pervasively into the fabric of industrial-era America is all the more impressive given that, as European capitals go, Berlin is a relative newcomer on the scene. The archeological evidence only dates human habitation in Berlin’s current site along the Spree back about 900 years. But the millions of German immigrants who imported their language and culture to the North America of the 1800s were apparently not the least bit deterred by this youthfulness. It’s as if the nascent American government of the time had looked Berlin over once or twice, noticed it carrying a fake membership card to the Great European Capitals Club, and opportunistically said, listen, they’ll never give you a drink in this place–but if you just hop on a boat and paddle across this tiny ocean…

Of course, what fate giveth, it often taketh away. What the founders of many of the transplanted Berlins, Athenses, and Romes probably didn’t anticipate was the fragility of their new homes. It turns out that growing a motley collection of homesteads into a successful settlement is no easy trick–and, as the celebrity-loving parents of innumerable children have no doubt found out, having a famous namesake doesn’t always confer a protective effect. In some cases, it may be actively harmful. As a cruel example, consider Berlin, Ontario–a thriving community of around 20,000 people when the first World War broke out. But in 1916, at the height of the war, a plurality of 346 residents voted to change Berlin’s name to Kitchener–beating out other shortlisted names like Huronto, Bercana (“a mixture of Berlin and Canada”), and Hydro City. Pressured by a campaign of xenophobia, the residents of Berlin, Ontario–a perfectly ordinary Canadian city that had exactly nothing to do with Kaiser Wilhelm II‘s policies on the Continent–opted to renounce their German heritage and rename their town after the British Field Marshal Herbert Kitchener (by some accounts a fairly murderous chap in his own right).

In most cases, of course, dissolution was a far more mundane process. Much like any number of other North American settlements with less grandiose names, many of the transplanted European capitals drifted towards their demise slowly–perhaps driven by nothing more than their inhabitants’ gradual realization that the good life was to be found elsewhere–say, in Chicago, Philadelphia, or (in later years) Los Angeles. In an act of considerable understatement, the historian O.L. Baskin observed in 1880 that a certain Rome, Ohio—which, at the time of Baskin’s writing, had already been effectively deceased for several decades—”did not bear any resemblance to ancient Rome.” The town, Baskin wrote, “passed into oblivion, and like the dead, was slowly forgotten”. And since Baskin wrote these words in an 1880 volume that has been out of print for over 100 years now, for a long time, there was a non-negligible risk that any memory of a place called Rome in Morrow County, Ohio might be completely obliterated from the pages of history.

But that was before the rise of the all-seeing, ever-recording beast that is the internet—the same beast that guarantees we will collectively never forget that it rained half an inch in Lincoln, Nebraska on October 15, 1984, or who was on the cast of the 1996 edition of MTV’s Real World. Immortalized in its own Wikipedia stub, the memory of Rome, Morrow County, Ohio will probably live exactly as long as the rest of our civilization. Its real fate, it turns out, is not to pass into total oblivion, but to ride the mercurial currents of global search history, gradually but ineluctably decreasing in mind-share.

The same is probably true, to a lesser extent, of most of the other transplanted European capitals of North America. London, Ontario isn’t going anywhere, but some of the other small Athenses and Berlins of North America might. As the population of the US and Canada continues to slowly urbanize, there could conceivably come a time when the last person in Paris, MI, decides that gazing upon a 20-foot forest replica of the Eiffel Tower once a week just isn’t a good enough reason to live an hour away from the nearest Thai restaurant.

Paris, Michigan

For the time being though, these towns survive. And if you live in the US or Canada, the odds are pretty good that at least one of them is only a couple of hours drive away from you. If you’re reading this in Dallas at 10 am, you could hop in your car, drive to Paris (TX), have lunch at BurgerLand, spend the afternoon walking aimlessly around the Texan incarnation of the Eiffel Tower, and still be home well before dinner.

The way I see it, though, there’s no reason to stop there. You’re already sitting in your car and listening to podcasts; you may as well keep going, right? I mean, once you hit Paris (TX), it’s only a few more hours to Moscow (TX). Past Moscow, it’s 2 hours to Berlin–that’s practically next door. Then Praha, Buda, London, Dublin, and Rome (well, fine, “Rhome”) all quickly follow. It turns out you can string together a circular tour of no fewer than 10 European capitals in under 20 hours–all without ever leaving Texas.

But the way I really see it, there’s no reason to stop there either. EuroTexas has its own local flavor, sure; but you can only have so much barbecue, and buy so many guns, before you start itching for something different. And taking the tour national wouldn’t be hard; there are literally hundreds of former Eurocapitals to explore. Throw in Canada–all it takes is a quick stop in Brussels, Athens, or London (Ontario)–and could easily go international. I’m not entirely ashamed to admit that, in my less lucid, more bored moments, I’ve occasionally contemplated what it would be like to set out on a really epic North AmeroEuroCapital tour. I’ve even gone so far as to break out a real, live paper map (yes, they still exist) and some multicolored markers (of the non-erasable kind, because that signals real commitment). Three months, 48 states, 200 driving hours, and, of course, no bathroom breaks… you get the picture.

Of course, these plans never make it past the idle fantasy stage. For one thing, the European capitals of North America are much farther apart than the actual European capitals; such is the geographic legacy of the so-called New World. You can drive from London, England to Paris, France in five hours (or get there in under three hours on the EuroStar), but it would take you ten times that long to get from Paris, Maine to Paris, Oregon–if you never stopped to use the bathroom.

I picture how the conversation with my wife would go:

Wife: “You want us to quit our jobs and travel around the United States indefinitely, using up all of our savings to visit some of the smallest, poorest, most rural communities in the country?”

Me: “Yes.”

Wife: “That’s a fairly unappealing proposal.”

Me: “You’re not wrong.”

And so we’ll probably never embark on a grand tour of all of the Athenses or Parises or Romes in North America. Because, if I’m being honest with myself, it’s a really stupid idea.

Instead, I’ve adopted a much less romantic, but eminently more efficient, touring strategy: I travel virtually. About once a week, for fifteen or twenty minutes, I slip on my Google Daydream, fire up Street View, and systematically work my way through one tiny North AmeroEurocapital after another. I can be in Athens, Ohio one minute and Rome, Oregon the next—with enough time in between for a pit stop in Stockholm (Wisconsin), Vienna (Michigan), or Brussels (Ontario). Admittedly, stomping around the world virtually in bulky, low-resolution goggles probably doesn’t confer quite the same connection to history that one might get by standing at dusk on a Morrow County (OH) hill that used to be Rome, or peering out into the Nevadan desert from the ruins of a mining boomtown Athens. But you know what? It ain’t half bad. I made you a small highlight reel below. Safe travels!

The Great North AmeroEuroCapital Tour — November 2017

College kids on cobblestones; downtown Athens, Ohio.

Rewind in Time; Madrid, New Mexico.

Propane and sky; Vienna Township, Gennessee County, Michigan.

Our Best Pub is Our Only Pub; Stockholm, Wisconsin.

Eiffel Tower in the off-season; Paris, Texas.

Rome was built in a day; Rome, Oregon (pictured here in its entirety).

If we had hills, they’d be alive; Berne, Indiana.

On your left; Vienna, Missourah.

Fast Times in Metropolitan Maine; Stockholm, Maine.

Bricks & trees in little Belgium; Brussels, Ontario.

Springfield, MO. (Just making sure you’re paying attention.)

Big Data, n. A kind of black magic

The annual Association for Psychological Science meeting is coming up in San Francisco this week. One of the cross-cutting themes this year is “Big Data: Understanding Patterns of Human Behavior”. Since I’m giving two Big Data-related talks (1, 2), and serving as discussant on a related symposium, I’ve been spending some time recently trying to come up with a sensible definition of Big Data within the context of psychological science. This has, in turn, led me to ponder the meaning of Big Data more generally.

After a few sleepless nights mulling it over for a while, I’ve concluded that producing a unitary, comprehensive, domain-general definition of Big Data is probably not possible, for the simple reason that different communities have adopted and co-opted the term for decidedly different purposes. For example, in said field of psychology, the very largest datasets that most researchers currently work with contain, at most, tens of thousands of cases and a few hundred variables (there are exceptions, of course). Such datasets fit comfortably into memory on any modern laptop; you’d have a hard time finding (m)any data scientists willing to call a dataset of this scale “Big”. Yet here we are, heading into APS, with multiple sessions focusing on the role of Big Data in psychological science. And psychology’s not unusual in this respect; we’re seeing similar calls for Big Data this and Big Data that in pretty much all branches of science and every area of the business world. I mean, even the humanities are getting in on the action.

You could take a cynical view of this and argue that all this really goes to show is that people like buzzwords. And there’s probably some truth to that. More pragmatically, though, we should acknowledge that language is this flexible kind of thing that likes to reshape itself from time to time. Words don’t have any intrinsic meaning above and beyond what we do with them, and it’s certainly not like anyone has a monopoly on a term that only really exploded into the lexicon circa 2011. So instead of trying to come up with a single, all-inclusive definition of Big Data, I’ve instead opted to try and make sense of the different usages we’re seeing in different communities. Below I suggest three distinct, but overlapping, definitions–corresponding to three different ways of thinking about what makes data “Big”. They are, roughly, (1) the kind of infrastructure required to support data processing, (2) the size of the dataset relative to the norm in a field, and (3) the complexity of the models required to make sense out of the data. To a first approximation, one can think of these as engineering, scientific, and statistical perspectives on Big Data, respectively.

The engineering perspective

One way to define Big Data is in terms of the infrastructure required to analyze the data. This is the closest thing we have to a classical definition. In fact, this way of thinking about what makes data “big” arguably predates the term Big Data itself. Take this figure, courtesy of Google Trends:

Notice that searches for Hadoop (a framework for massively distributed data-intensive computing) actually precede the widespread use of the term “Big Data” by a couple of years. If you’re the kind of person who likes to base their arguments entirely on search-based line graphs from Google (and I am!), you have here a rather powerful Exhibit A.

Alternatively, If you’re a more serious kind of person who privileges reason over pretty line plots, consider the following, rather simple, argument for Big Data qua infrastructure problem: any dataset that keeps growing is eventually going to get too big–meaning, it will inevitably reach a point at which it no longer fits into memory, or even onto local storage–and now requires a fundamentally different, massively parallel architecture to process. If you can solve your alleged “big data” problems by installing a new hard drive or some more RAM, you don’t really have a Big Data problem, you have an I’m-too-lazy-to-deal-with-this-right-now problem.

A real Big Data problem, from an engineering standpoint, is what happens once you’ve installed all the RAM your system can handle, maxed out your RAID array, and heavily optimized your analysis code, yet still find yourself unable to process your data in any reasonable amount of time. If you then complain to your IT staff about your computing problems and they start ranting to you about Hadoop and Hive and how you need to hire a bunch of engineers so you can build out a cluster and do Big Data the way Big Data is supposed to be done, well, congratulations–you now have a Big Data problem in the engineering sense. You now need to figure out how to build a highly distributed computing platform capable of handling really, really, large datasets.

Once the hungry wolves of Big Data have been killed off temporarily pacified by building a new data center (or, you know, paying for an AWS account), you may have to rewrite at least part of your analysis code to take advantage of the massive parallelization your new architecture affords. But conceptually, you can probably keep asking and answering the same kinds of questions with your data. In this sense, Big Data isn’t directly about the data itself, but about what the data makes you do: a dataset counts as “Big” whenever it causes you to start whispering sweet nothings in Hadoop’s ear at night. Exactly when that happens will depend on your existing infrastructure, the demands imposed by your data, and so on. On modern hardware, some people have suggested that the transition tends to happen fairly consistently when datasets get to around 5 – 10 TB in size. But of course, that’s just a loose generalization, and we all know that loose generalizations are always a terrible idea.

The scientific perspective

Defining Big Data in terms of architecture and infrastructure is all well and good in domains where normal operations regularly generate terabytes (or even–gasp–petabytes!) of data. But the reality is that most people–and even, I would argue, many people whose job title currently includes the word “data” in it–will rarely need to run analyses distributed across hundreds or thousands of nodes. If we stick with the engineering definition of Big Data, this means someone like me–a lowly social or biomedical scientist who frequently deals with “large” datasets, but almost never with gigantic ones–doesn’t get to say they do Big Data. And that seems kind of unfair. I mean, Big Data is totally in right now, so why should corporate data science teams and particle physicists get to have all the fun? If I want to say I work with Big Data, I should be able to say I work with Big Data! There’s no way I can go to APS and give talks about Big Data unless I can unashamedly look myself in the mirror and say, look at that handsome, confident man getting ready to go to APS and talk about Big Data. So it’s imperative that we find a definition of Big Data that’s compatible with the kind of work people like me do.

Hey, here’s one that works:

Big Data, n. The minimum amount of data required to make one’s peers uncomfortable with the size of one’s data.

This definition is mostly facetious–but it’s a special kind of facetiousness that’s delicately overlaid on top of an earnest, well-intentioned core. The earnest core is that, in practice, many people who think of themselves as Big Data types but don’t own a timeshare condo in Hadoop Land implicitly seem to define Big Data as any dataset large enough to enable new kinds of analyses that weren’t previously possible with smaller datasets. Exactly what dimensionality of data is sufficient to attain this magical status will vary by field, because conventional dataset sizes vary by field. For instance, in human vision research, many researchers can get away with collecting a few hundred trials from three subjects in one afternoon and calling it a study. In contrast, if you’re a population geneticist working with raw sequence data, you probably deal with fuhgeddaboudit amounts of data on a regular basis. So clearly, what it means to be in possession of a “big” dataset depends on who you are. But the point is that in every field there are going to be people who look around and say, you know what? Mine’s bigger than everyone else’s. And those are the people who have Big Data.

I don’t mean that pejoratively, mind you. Quite the contrary: an arms race towards ever-larger datasets strikes me as a good thing for most scientific fields to have, regardless of whether or not the motives for the data embigenning are perfectly cromulent. Having more data often lets you do things that you simply couldn’t do with smaller datasets. With more data, confidence intervals shrink, so effect size estimates become more accurate; it becomes easier to detect and characterize higher-order interactions between variables; you can stratify and segment the data in various ways, explore relationships with variables that may not have been of a priori interest; and so on and so forth. Scientists, by and large, seem to be prone to thinking of Big Data in these relativistic terms, so that a “Big” dataset is, roughly, a dataset that’s large enough and rich enough that you can do all kinds of novel and interesting things with it that you might not have necessarily anticipated up front. And that’s refreshing, because if you’ve spent much time hanging around science departments, you’ll know that the answer to about 20% of all questions during Q&A periods end with the words well, that’s a great idea, but we just don’t have enough data to answer that. Big Data, in a scientific sense, is when that answer changes to: hey, that’s a great idea, and I’ll try that as soon as I get back to my office. (Or perhaps more realistically: hey that’s a great idea, and I’ll be sure to try that–as soon as I can get my one tech-savvy grad student to wrangle the data into the right format.)

It’s probably worth noting in passing that this relativistic, application-centered definition of Big Data also seems to be picking up cultural steam far beyond the scientific community. Most of the recent criticisms of Big Data seem to have something vaguely like this definition in mind. (Actually, I would argue pretty strenuously that most of these criticisms aren’t really even about Big Data in this sense, and are actually just objections to mindless and uncritical exploratory analysis of any dataset, however big or small. But that’s a post for another day.)

The statistical perspective

A third way to think about Big Data is to focus on the kinds of statistical methods required in order to make sense of a dataset. On this view, what matters isn’t the size of the dataset, or the infrastructure demands it imposes, but how you use it. Once again, we can appeal to a largely facetious definition clinging for dear life onto a half-hearted effort at pithy insight:

Big Data, n: the minimal amount of data that allows you to set aside a quarter of your dataset as a hold-out and still train a model that performs reasonably well when tested out-of-sample.

The nugget of would-be insight in this case is this: the world is usually a more complicated place than it appears to be at first glance. It’s generally much harder to make reliable predictions about new (i.e., previously unseen) cases than one might suppose given conventional analysis practices in many fields of science. For example, in psychology, it’s very common to see papers report extremely large R2 values from fitted models–often accompanied by claims to the effect that the researchers were able to “predict” most of the variance in the outcome. But such claims are rarely actually supported by the data presented, because the studies in question overwhelmingly tend to overfit their models by using the same data for training and testing (to say nothing of p-hacking and other Questionable Research Practices). Fitting a model that can capably generalize to entirely new data often requires considerably more data than one might expect. The precise amount depends on the problem in question, but I think it’s fair to say that there are many domains in which problems that researchers routinely try to tackle with sample sizes of 20 – 100 cases would in reality require samples two or three orders of magnitude larger to really get a good grip on.

The key point is that when we don’t have a lot of data to work with, it’s difficult to say much of anything about how big an effect is (unless we’re willing to adopt strong priors). Instead, we tend to fall back on the crutch of null hypothesis significant testing and start babbling on about whether there is or isn’t a “statistically significant effect”. I don’t really want to get into the question of whether the latter kind of thinking is ever useful (see Krantz (1999) for a review of its long and sordid history). What I do hope is not controversial is this: if your conclusions are ever in danger of changing radically depending on whether the coefficients in your model are on this side of p = .05 versus that side of p = .05, those conclusions are, by definition, not going to be terribly reliable over the long haul. Anything that helps move us away from that decision boundary and puts us in a position where we can worry more about what our conclusions ought to be than about whether we should be saying anything at all is a good thing. And since the single thing that matters most in that regard is the size of our dataset, it follows that we should want to have datasets that are as Big as possible. If we can fit complex models using lots of features and show that those models still perform well when tested out-of-sample, we can feel much more confident about whatever else we feel inclined to say.

From a statistical perspective, then, one might say that a dataset is “Big” when it’s sufficiently large that we can spend most of our time thinking about what kinds of models to fit and what kinds of features to include so as to maximize predictive power and/or understanding, rather than worrying about what we can and can’t do with the data for fear of everything immediately collapsing into a giant multicollinear mess. Admittedly, this is more of a theoretical ideal than a practical goal, because as Andrew Gelman points out, in practice “N is never large”. As soon as we get our hands on enough data to stabilize the estimates from one kind of model, we immediately go on to ask more fine-grained questions that require even more data. And we don’t stop until we’re right back where we started, hovering at the very edge of our ability to produce sensible estimates, staring down the precipice of uncertainty. But hey, that’s okay. Nobody said these definitions have to be useful; it’s hard enough just trying to make them semi-coherent.

Conclusion

So there you have it: three ways to define Big Data. All three of these definitions are fuzzy, and will bleed into one another if you push on them a little bit. In particular, you could argue that, extensionally, the engineering definition of Big Data is a superset of the other two definitions, as it’s very likely that any dataset big enough to require a fundamentally different architecture is also big enough to handle complex statistical models and to do interesting and novel things with. So the point of all this is not to describe three completely separate communities with totally different practices; it’s simply to distinguish between three different uses of the term Big Data, all of which I think are perfectly sensible in different contexts, but that can cause communication problems when people from different backgrounds interact.

Of course, this isn’t meant to be an exhaustive catalog. I don’t doubt that there are many other potential definitions of Big Data that would each elicit enthusiastic head nods from various communities. For example, within the less technical sectors of the corporate world, there appears to be yet another fairly distinctive definition of Big Data. It goes something like this:

Big Data, n. A kind of black magic practiced by sorcerers known as quants. Nobody knows how it works, but it’s capable of doing anything.

In any case, the bottom line here is really just that context matters. If you go to APS this week, there’s a good chance you’ll stumble across many psychologists earnestly throwing the term “Big Data” around, even though they’re mostly discussing datasets that would fit snugly into a sliver of memory on a modern phone. If your day job involves crunching data at CERN or Google, this might amuse you. But the correct response, once you’re done smiling on the inside, is not, Hah! That’s not Big Data, you idiot! It should probably be something more like Hey, you talk kind of funny. You must come from a different part of the world than I do. We should get together some time and compare notes.

estimating the influence of a tweet–now with 33% more causal inference!

Twitter is kind of a big deal. Not just out there in the world at large, but also in the research community, which loves the kind of structured metadata you can retrieve for every tweet. A lot of researchers rely heavily on twitter to model social networks, information propagation, persuasion, and all kinds of interesting things. For example, here’s the abstract of a nice recent paper on arXiv that aims to  predict successful memes using network and community structure:

We investigate the predictability of successful memes using their early spreading patterns in the underlying social networks. We propose and analyze a comprehensive set of features and develop an accurate model to predict future popularity of a meme given its early spreading patterns. Our paper provides the first comprehensive comparison of existing predictive frameworks. We categorize our features into three groups: influence of early adopters, community concentration, and characteristics of adoption time series. We find that features based on community structure are the most powerful predictors of future success. We also find that early popularity of a meme is not a good predictor of its future popularity, contrary to common belief. Our methods outperform other approaches, particularly in the task of detecting very popular or unpopular memes.

One limitation of much of this body of research is that the data are almost invariably observational. We can build sophisticated models that do a good job predicting some future outcome (like meme success), but we don’t necessarily know that the “important” features we identify carry any causal influence. In principle, they could be completely epiphenomenal–for example, in the study I linked to, maybe the community structure features are just a proxy for some other, causally important, factor (e.g., whether the content of a meme has sufficiently broad appeal to attract attention from many different kinds of people). From a predictive standpoint, this may not matter much; if your goal is just to passively predict whether a meme is going to be successful or not, it’s irrelevant whether or not the features you’re using are doing causal work. On the other hand, if you want to actively design memes in such a way as to maximize their spread, the ability to get a handle on causation starts to look pretty important.

How can we estimate the direct causal influence of a tweet on the downstream popularity of a meme? Here’s a simple and (I suspect) very feasible idea in two steps:

  1. Create a small web app that allows any existing Twitter user to register via Twitter authentication. On signing up, a user has to specify just one (optional) setting: the proportion of their intended retweets they’re willing to withhold. Let’s this the Withholding Fraction (WF).
  2. Every time (or at least some of the time) a registered user wants to retweet a particular tweet*, they do so via the new web app’s interface (which has permission to post to the user’s Twitter account) instead of whatever interface they’re currently using. The key is that the retweet isn’t just obediently passed along; instead, the target tweet is retweeted successfully with probability (1 – WF), and randomly suppressed from the user’s stream with probability (WF).

Doing this  would allow the community to very quickly (assuming rapid adoption, which seems reasonably likely) build up an enormous database of tweets that were targeted for retweeting by an active user, but randomly assigned to fail with some known probability. Researchers would then be able to directly quantify the causal impact of individual retweets on downstream popularity–and to estimate that influence conditional on all of the other standard variables, like the retweeter’s number of followers, the content of the tweet, etc. Of course, this still wouldn’t get us to true experimental manipulation of such features (i.e., we wouldn’t be manipulating users’ follower networks, just randomly omitting tweets from users with different followers), but it seems like a step in the right direction**.

I figure building a barebones app like this would take an experienced developer familiar with the Twitter OAuth API just a day or two. And I suspect many people (myself included!) would be happy to contribute to this kind of experiment, provided that all of the resulting data were made public. (I’m aware that there are all kinds of restrictions on sharing assembled Twitter datasets, but we’re not talking about sharing firehose dumps here, just a restricted set of retweets from users who’ve explicitly given their consent to have the data used in this way.)

Has this kind of thing already been done? If not, does anyone want to build it?

 

* It doesn’t just have to be retweets, of course; the same principle would work just as well for withholding a random fraction of original tweets. But I suspect not many users would be willing to randomly eliminate a proportion of their original content from the firehose.

** If we really wanted to get close to true random assignment, we could potentially inject selected tweets into random users streams based on selected criteria. But I’m not sure how many tweeps would consent to have entirely random retweets published in their name (I probably wouldn’t), so this probably isn’t viable.

Jirafas

This is fiction.

The party is supposed to start at 7 pm, but of course, no one shows up before 8:45. When the guests finally do arrive, I randomly assign each of them to one of four groups–A through D–as they enter. Each assignment comes with an adhesive 2″ color patch, a nametag, and a sharpie.

The labels are not for the dinner,” I say, “they’re for the orgy that follows the dinner. The bedrooms are all color-coded; there are strict rules governing inter-cubicular transitions. Please read the manual on the table.”

Nobody moves to pick up the manual. There’s a long and uncomfortable silence, made longer and more uncomfortable by the fact that we can all hear the upstairs neighbors loudly having sex on their kitchen counter.

“Turn on the music,” my wife says. “It masks the sex.”

I put on some music. Something soft, by Elton John, followed by something angry—a duet by Tenacious D and Leonard Skynyrd. One of the guests—unsoothed by the music, and noticing the random collection of chairs scattered around the living room—grows restless and asks whether we will all be playing musical chairs this fine evening.

“No,” I reply; “this fine night, we all play Mafia.” Then I shoot him dead as everyone else pretends to stare out the window.

In the kitchen, my wife uncorks the last bottle of wine. As trendy wines go, this one wears its pretention with pride: Jugo de Jirafas, the label proclaims in vermilion Helvetica Neue overtones.

“What does jirafas mean,” I ask my Spanish friend. “Giraffes?”

“No,” she says. “Jirafas was a famous rebel general who came out of hiding during the Spanish Civil War to challenge Franco to a fight to the death. They brawled in the streets for hours, and and just when it looked like Jirafas was about to snap Franco’s neck, Franco screamed for his deputies, who immediately pumped several rounds straight through Jirafas’s heart. They say the body continued to bleed courage into the street for several weeks.”

Jugo de Jirafas, I enunciate out loud.

There’s an awkward silence in the living room as the assembled guests all hold an involuntary thirty-second vigil for the dearly departed General Jirafas, who was taken from us much too soon. Poor man—we barely knew him.

Then the vigil is broken up by the arrival of my Brazilian friend João, who lives across the way. Our housing complex is nominally open to all faculty and staff affiliated with the university, but in practice it more or less operates as a kind of hippie commune for expatriate scientists. On any given day you can hear forty different languages being spoken, and stumble across marauding groups of eight-year old children all babbling away at each other in mutual incomprehension. Walking through our apartment complex is like taking a simultaneous trip through every foreign-language channel on extended cable.

It does have its perks, though. For example, if you want to experience other cultures, you don’t need to travel anywhere. When people suggest that I’ve been working too hard and need a vacation, I yell at João through the bedroom window: how’s Rio this time of year?

Exceptional, he’ll yell back. The cannonball trees are in full bloom. You should come for a visit.

Then I usually take a bottle of wine over—nothing of Jugo de Jirafas caliber, just a basic Zinfandel from Whole Foods—and we sit around and talk about the strange places we’ve lived: Rio and Istanbul for him; Mombasa and Ottawa for me. After dinner we usually play a few games of backgammon, which is not a Brazilian game at all, but is acceptable to play because João spent three years of his life doing a postdoc in Turkey. Thus begins and ends my cosmetic Latin American vacation, punctuated by a detour to the Near East.

Tonight, João shows up with a German lady on his arm. She’s a newly arrived faculty member in the Department of Earth Sciences.

“This is the bad Jew I was telling you about,” he says to the lady by way of introduction.

“It’s true,” I say; “I’m a very bad Jew. Even by Jewish standards.”

She wants to know what makes a Jew a bad Jew. I tell her I eat bacon on the Sabbath and wrap myself in cheeseburgers before bed. And that I make sure to drink the blood of goyim at least four times a year. And that I’m so money-hungry and cunning, I’ve been banned from lending money even to other Jews.

My joke doesn’t go over so well. Germans have had, for obvious reasons, a lot of trouble putting the war behind them. When you make Jew jokes in Germany, people give you a look that’s made up of one part contempt, one part cognitive dissonance. They don’t know what to do; it’s like you’ve lit a warehouse full of bottle rockets up inside their heads all at once. As an American, I don’t mind this, of course. In America, it’s your god-given birthright to make ethnic jokes at your own expense. As long as you’re making fun only of your own in-group and nobody else, no one is allowed to come between you and your chuckles.

The German lady doesn’t see it this way.

“You should not make fun of the Jews,” she says in over-articled English. “Even if you are a one yourself.”

“Well,” says I. “If you can’t laugh at yourself, who can you laugh at?”

She shrugs her shoulders.

“Other people,” offers João.

So I laugh at João, because he’s another person. There’s an uncomfortable pause, but then the earth scientist–whose name turns out to be Brunhilde–laughs too. A moment later, we’re all making small talk again, and I feel pretty confident that any budding crisis in diplomatic relations has been averted.

“Speaking of making fun of others,” João says, “what happened to your lip? It looks like you have the herpes.”

“I damaged myself while flossing,” I tell him.

It’s true: I have a persistent cut on my lip caused by aggressive flossing. It refuses to heal. And now, after several days of incubation, it looks exactly like a cold sore. So I have to walk around my life constantly putting up with herpes jokes.

“I’ll go put something on it,” I say, self-consciously rubbing at the wound. “You just stand here and keep laughing at me, you anti-semite.”

Turns out, I’ve forgotten the name of the lip balm my wife buys. So I walk around the party with a chafed, bloody lip, asking everyone I know if they’ve seen my Tampax. The guests mostly demur quietly, but one particularly mercurial friend looks slightly alarmed, and slowly starts to edge towards the door.

He means Carmex, my wife yells from the kitchen.

Eventually, all of the wine is drunk and the conversation is spent. The guests begin to leave, each one curling his or her self carefully through the doorway in sequence. For some reason, they remind me of ants circling around a drain—but I don’t tell anyone that. There is no longer any music; there was never an orgy. There are no more Jew jokes. I turn the phonograph off—by which I mean I press the stop button on my iTunes playlist—and dim the lights. My wife stays downstairs.

“To do some research,” she says.

Much later, just as I’m making the delicate nightly transition from restless leg syndrome to stage 1 sleep, I’m suddenly jarred wide awake by the sound of someone cursing loudly and repeatedly as they get into bed next to me. I vaguely recognize my wife’s voice, though it sounds different over the haze of near-sleep and a not-insignificant amount of wine.

What’s going on, I ask her.

She mutters that she’s just spent the last hour and a half exhausting the infinite wisdom of Google, circumnavigating the information superhighway, and consulting with various technical support workers scattered all around the Indian subcontinent. And the clear consensus among all sources is that there is not now, and never was, any General Jirafas.

“It just means giraffes,” she says.

…and then there were two!

Last year when I launched my lab (which, full disclosure, is really just me, plus some of my friends who were kind enough to let me plaster their names and faces on my website), I decided to call it the Psychoinformatics Lab (or PILab for short and pretentious), because, well, why not. It seemed to nicely capture what my research is about: psychology and informatics. But it wasn’t an entirely comfortable decision, because a non-trivial portion of my brain was quite convinced that everyone was going to laugh at me. And even now, after more than a year of saying I’m a “psychoinformatician” whenever anyone asks me what I do, I still feel a little bit fraudulent each time–as if I’d just said I was a member of the Estonian Cosmonaut program, or the president of the Build-a-Bear fan club*.

But then… just last week… everything suddenly changed! All in one fell swoop–in one tiny little nudge of a shove-this-on-the-internet button, things became magically better. And now colors are vibrating**, birds are chirping merry chirping songs–no, wait, those are actually cicadas–and the world is basking in a pleasant red glow of humming monitors and five-star Amazon reviews. Or something like that. I’m not so good with the metaphors.

Why so upbeat, you ask? Well, because as of this writing, there is no longer just the one lone Psychoinformatics Lab. No! Now there are not one, not three, not seven Psychoinformatics Labs, but… two! There are two Psychoinformatics Labs. The good Dr. Michael Hanke (of PyMVPA and NeuroDebian fame) has just finished putting the last coat of paint on the inside of his brand new cage Psychoinformatics Lab at the Otto-von-Guericke University Magdeburg in Magdeburg, Germany. No, really***: his startup package didn’t include any money for paint, so he had to barter his considerable programming skills for three buckets of Going to the Chapel (yes, that’s a real paint color).

The good Dr. Hanke drifts through interstellar space in search of new psychoinformatic horizons.

Anyway, in case you can’t tell, I’m quite excited about this. Not because it’s a sign that informatics approaches are making headway in psychology, or that pretty soon every psychology lab will have a high-performance computing cluster hiding in its closet (one can dream, right?). No sir. I’m excited for two much more pedestrian reasons. First, because from now on, any time anyone makes fun of me for calling myself a psychoinformatician, I’ll be able to say, with a straight face, well it’s not just me, you know–there are multiple ones of us doing this here research-type thing with the data and the psychology and the computers. And secondly, because Michael is such a smart and hardworking guy that I’m pretty sure he’s going to legitimize this whole enterprise and drag me along for the ride with him, so I won’t have to do anything else myself. Which is good, because if laziness was an olympic sport, I’d never leave the starting block.

No, but in all seriousness, Michael is an excellent scientist and an exceptional human being, and I couldn’t be happier for him in his new job as Lord Director of All Things Psychoinformatic (Eastern Division). You might think I’m only saying this because he just launched the world’s second PILab, complete with quote from yours truly on said lab’s website front page. Well, you’d be right. But still. He’s a pretty good guy, and I’m sure we’re going to see amazing things coming out of Magdeburg.

Now if anyone wants to launch PILab #3 (maybe in Asia or South America?), just let me know, and I’ll make you the same offer I made Michael: an envelope full of $1 bills (well, you know, I’m an academic–I can’t afford Benjamins just yet) and a blog post full of ridiculous superlatives.

 

* Perhaps that’s not a good analogy, because that one may actually exist.

** But seriously, in real life, colors should not vibrate. If you ever notice colors vibrating, drive to the nearest emergency room and tell them you’re seeing colors vibrating.

*** No, not really.

the seedy underbelly

This is fiction. Science will return shortly.


Cornelius Kipling doesn’t take No for an answer. He usually takes several of them–several No’s strung together in rapid sequence, each one louder and more adamant than the last one.

“No,” I told him over dinner at the Rhubarb Club one foggy evening. “No, no, no. I won’t bankroll your efforts to build a new warp drive.”

“But the last one almost worked,” Kip said pleadingly. “I almost had it down before the hull gave way.”

I conceded that it was a clever idea; everyone before Kip had always thought of warp drives as something you put on spaceships. Kip decided to break the mold by placing one on a hydrofoil. Which, naturally, made the boat too heavy to rise above the surface of the water. In fact, it made the boat too heavy to do anything but sink.

“Admittedly, the sinking thing is a small problem,” he said, as if reading my thoughts. “But I’m working on a way to adjust for the extra weight and get it to rise clear out of the water.”

“Good,” I said. “Because lifting the boat out of the water seems like a pretty important step on the road to getting it to travel through space at light speed.”

“Actually, it’s the only remaining technical hurdle,” said Kip. “Once it’s out of the water, everything’s already taken care of. I’ve got onboard fission reactors for power, and a tentative deal to use the International Space Station for supplies. Virgin Galactic is ready to license the technology as soon as we pull off a successful trial run. And there’s an arrangement with James Cameron’s new asteroid mining company to supply us with fuel as we boldly go where… well, you know.”

“Right,” I said, punching my spoon into my crème brûlée in frustration. The crème brûlée retaliated by splattering itself all over my face and jacket.

“See, this kind of thing wouldn’t happen to you if you invested in my company,” Kip helpfully suggested as he passed me an extra napkin. “You’d have so much money other people would feed you. People with ten or fifteen years of experience wielding dessert spoons.”


After dinner we headed downtown. Kip said there was a new bar called Zygote he wanted to show me.

“Actually, it’s not a new bar per se,” he explained as we were leaving the Rhubarb. “It’s new to me. Turns out it’s been here for several years, but you have to know someone to get in. And that someone has to be willing to sponsor you. They review your biography, look up your criminal record, make sure you’re the kind of person they want at the bar, and so on.”

“Sounds like an arranged marriage.”

“You’re not too far off. When you’re first accepted as a member, you’re supposed to give Zygote a dowry of $2,000.”

“That’s a joke, right?” I asked.

“Yes. There’s no dowry. Just the fee.”

“Two thousand dollars? Really?”

“Well, more like fifty a year. But same principle.”

We walked down the mall in silence. I could feel the insoles of my shoes wrapping themselves around my feet, and I knew they were desperately warning me to get away from Kip while I still had a limited amount of sobriety and dignity left.

“How would anyone manage to keep a place like that secret?” I asked. “Especially on the mall.”

“They hire hit men,” Kip said solemnly.

I suspected he was joking, but couldn’t swear to it. I mean, if you didn’t know Kip, you would probably have thought that the idea of putting a warp drive on a hydrofoil was also a big joke.

Kip led us into one of the alleys off Pearl Street, where he quickly located an unobtrusive metal panel set into the wall just below eye level. The panel opened inwards when we pushed it. Behind the panel, we found a faint smell of old candles and a flight of stairs. At the bottom of the stairs–which turned out to run three stories down–we came to another door. This one didn’t open when we pushed it. Instead, Kip knocked on it three times. Then twice more. Then four times.

“Secret code?” I asked.

“No. Obsessive-compulsive disorder.”

The door swung open.

“Evening, Ashraf,” Kip said to the doorman as we stepped through. Ashraf was a tiny Middle Eastern man, very well dressed. Suede pants, cashmere scarf, fedora on his head. Feather in the fedora. The works. I guess when your bar is located behind a false wall three stories below grade, you don’t really need a lot of muscle to keep the peasants out; you knock them out with panache.

“Welcome to Zygote,” Ashraf said. His bland tone made it clear that, truthfully, he wasn’t at all interested in welcoming anyone anywhere. Which made him exactly the kind of person an establishment like this would want as its doorman.

Inside, the bar was mostly empty. There were twelve or fifteen patrons scattered across various booths and animal-print couches. They all took great care not to make eye contact with us as we entered.

“I have to confess,” I whispered to Kip as we made our way to the bar. “Until about three seconds ago, I didn’t really believe you that this place existed.”

“No worries,” he said. “Until about three seconds ago, it had no idea you existed either.”

He looked around.

“Actually, I’m still not sure it knows you exist,” he added apologetically.

“I feel like I’m giving everyone the flu just by standing here,” I told him.

We took a seat at the end of the bar and motioned to the bartender, who looked to be high on a designer drug chemically related to apathy. She eventually wandered over to us–but not before stopping to inspect the countertop, a stack of coasters with pictures of archaeological sites on them, a rack of brandy snifters, and the water running from the faucet.

“Two mojitos and a strawberry daiquiri,” Kip said when she finally got close enough to yell at.

“Who’s the strawberry daiquiri for,” I asked.

“Me. They’re all for me. Why, did you want a drink too?”

I did, so I ordered the special–a pink cocktail called a Flamingo. Each Flamingo came in a tall Flamingo-shaped glass that couldn’t stand up by itself, so you had to keep holding it until you finished it. Once you were done, you could lay the glass on its side on the counter and watch it leak its remaining pink guts out onto the tile. This act was, I gathered from Kip, a kind of rite of passage at Zygote.

“This is a very fancy place,” I said to no one in particular.

“You should have seen it before the gang fights,” the bartender said before walking back to the snifter rack. I had high hopes she would eventually get around to filling our order.

“Gang fights?”

“Yes,” Kip said. “Gang fights. Used to be big old gang fights in here every other week. They trashed the place several times.”

“It’s like there’s this whole seedy underbelly to Boulder that I never knew existed.”

“Oh, this is nothing. It goes much deeper than this. You haven’t seen the seedy underbelly of this place until you’ve tried to convince a bunch of old money hippies to finance your mass-produced elevator-sized vaporizer. You haven’t squinted into the sun or tasted the shadow of death on your shoulder until you’ve taken on the Bicycle Triads of North Boulder single-file in a dark alley. And you haven’t tried to scratch the dirt off your soul–unsuccessfully, mind you–until you’ve held all-night bargaining sessions with local black hat hacker groups to negotiate the purchase of mission-critical zero-day exploits.”

“Well, that may all be true,” I said. “But I don’t think you’ve done any of those things either.”

I should have known better than to question Kip’s credibility; he spent the next fifteen minutes reminding me of the many times he’d risked his life, liberty, and (nonexistent) fortune fighting to suppress the darkest forces in Northern Colorado in the service of the greater good of mankind.

After that, he launched into his standard routine of trying to get me to buy into the latest round of his inane startup ideas. He told me, in no particular order, about his plans to import, bottle and sell the finest grade Kazakh sand as a replacement for the substandard stuff currently found on American kindergarten sandlots; to run a “reverse tourism” operation that would fly in members of distant cultures to visit disabled would-be travelers in the comfort of their own living rooms (tentative slogan: if the customer can’t come to Muhammad, Muhammad must come to the customer); and to create giant grappling hooks that could pull Australia closer to the West Coast so that Kip could speculate in airline stocks and make billions of dollars once shorter flights inevitably caused Los Angeles-Sydney routes to triple in passenger volume.

I freely confess that my recollection of the finer points of the various revenue enhancement plans Kip proposed that night is not the best. I was a little bit distracted by a woman at the far end of the bar who kept gesturing towards me the whole time Kip was talking. Actually, she wasn’t so much gesturing towards me as gently massaging her neck. But she only did it when I happened to look at her. At one point, she licked her index finger and rubbed it on her neck, giving me a pointed look.

After about forty-five minutes of this, I finally worked up the courage to interrupt Kip’s explanation of how and why the federal government could solve all of America’s economic problems overnight by convincing Balinese children to invest in discarded high school football uniforms.

“Look,” I told him, pointing down to the other side of the bar. “You see? This is why I don’t go to bars any more now that I’m married. Attractive women hit on me, and I hate to disappoint them.”

I raised my left hand and deliberately stroked my wedding band in full view.

The lady at the far end didn’t take the hint. Quite the opposite; she pushed back her bar stool and came over to us.

“Christ,” I whispered.

Kip smirked quietly.

“Hi,” said the woman. “I’m Suzanne.”

“Hi,” I said. “I’m flattered. And also married.”

“I see that. I also see that you have some food in your… neckbeard. It looks like whipped cream. At least I hope that’s what it is. I was trying to let you know from down there, so you could wipe it off without embarrassing yourself any further. But apparently you’d rather embarrass yourself.”

“It’s crème brûlée,” I mumbled.

“Weak,” said Suzanne, turning around. “Very weak.”

After she’d left, I wiped my neck on my sleeve and looked at Kip. He looked back at me with a big grin on his face.

“I don’t suppose the thought crossed your mind, at any point in the last hour, to tell me I had crème brûlée in my beard.”

“You mean your neckbeard?”

“Yes,” I sighed, making a mental note to shave more often. “That.”

“It certainly crossed my mind,” Kip said. “Actually, it crossed my mind several times. But each time it crossed, it just waved hello and kept right on going.”

“You know you’re an asshole, right?”

“Whatever you say, Captain Neckbeard.”

“Alright then,” I sighed. “Let’s get out of here. It’s past my curfew anyway. Do you remember where I left my car?”

“No need,” said Kip, putting on his jacket and clapping his hand to my shoulder. “My hydrofoil’s parked in the Spruce lot around the block. The new warp drive is in. Walk with me and I’ll give you a ride. As long as you don’t mind pushing for the first fifty yards.”

several half-truths, and one blatant, unrepenting lie about my recent whereabouts

Apparently time does a thing that is much like flying. Seems like just yesterday I was sitting here in this chair, sipping on martinis, and pleasantly humming old show tunes while cranking out several high-quality blog posts an hour a mediocre blog post every week or two. But then! Then I got distracted! And blinked! And fell asleep in my chair! And then when I looked up again, 8 months had passed! With no blog posts!

Granted, on the Badness Scale, which runs from 1 to Imminent Apocalypse, this one clocks in at a solid 1.04. But still, eight months is a long time to be gone–about 3,000 internet years. So I figured I’d write a short post about the events of the past eight months before setting about the business of trying (and perhaps failing) to post here more regularly. Also, to keep things interesting, I’ve thrown in one fake bullet. See if you can spot the impostor.

  • I started my own lab! You can tell it’s a completely legitimate scientific operation because it has (a) a fancy new website, (b) other members besides me (some of whom I admittedly had to coerce into ‘joining’), and (c) weekly meetings. (As far as I can tell, these are all the necessary requirements for official labhood.) I decided to call my very legitimate scientific lab the Psychoinformatics Lab. Partly because I like how it sounds, and partly because it’s vaguely descriptive of the research I do. But mostly because it results in a catchy abbreviation: PILab. (It’s pronounced Pieeeeeeeeeee lab–the last 10 e’s are silent.)
  • I’ve been slowly writing and re-writing the Neurosynth codebase. Neurosynth is a thing made out of software that lets neuroimaging researchers very crudely stitch together one giant brain image out of other smaller brain images. It’s kind of like a collage, except that unlike most collages, in this case the sum is usually not more than its parts. In fact, the sum tends to look a lot like its parts. In any case, with some hard work and a very large serving of good luck, I managed to land a R01 grant from the NIH last summer, which will allow me to continue stitching images for a few more years. From my perspective, this is a very good thing, for two reasons. FIrst, because it means I’m not unemployed right now (I’m a big fan of employment, you see); and secondly, because I’m finding the stitching surprisingly enjoyable. If you enjoy stitching software into brain images, please help out.
  • I published a bunch of papers in 2012, so, according to my CV at least, it was a good year for me professionally. Actually, I think it was a deceptively good year–meaning, I don’t think I did any more work than I did in previous years, but various factors (old projects coming to fruition, a bunch of papers all getting accepted at the same time, etc.) conspired to produce more publications in 2012. This kind of stuff has a tendency to balance out in fairly short order though, so I fully expect to rack up a grand total of zero publications in 2013.
  • I went to Iceland! And England! And France! And Germany! And the Netherlands! And Canada! And Austin, Texas! Plus some other places. I know many people spend a lot of their time on the road and think hopping across various oceans is no big deal, but, well, it is to me, so BACK OFF. Anyway, it’s been nice to have the opportunity to travel more. And to combine business and pleasure. I am not one of those people–I think you call them ‘sane’–who prefer to keep their work life and their personal life cleanly compartmentalized, and try to cram all their work into specific parts of the year and then save a few days or weeks here and there to do nothing but roll around on the beach or ski down frighteningly tall mountains. I find I’m happiest when I get to spend one part of the day giving a talk or meeting with some people to discuss the way the edges of the brain blur when you shake your head, and then another part of the day roaming around De Jordaan asking passers-by, in a stilted Dutch, “where can I find some more of those baby cheeses?”
  • On a more personal note (as the archives of this blog will attest, I have no shame when it comes to publicly divulging embarrassing personal details), my wife and I celebrated our fifth anniversary a few weeks ago. I think this one is called the congratulations, you haven’t killed each other yet! anniversary. Next up: the ten year anniversary, also known as the but seriously, how are you both still alive? decennial. Fortunately we’re not particularly sentimental people, so we celebrated our wooden achievement with some sushi, some sake, and only 500 of our closest friends an early bedtime (no, seriously–we went to bed early; that’s not a euphemism for anything).
  • I contracted a bad case of vampirism while doing some prospecting work in the Yukon last summer. The details are a little bit sketchy, but I have a vague suspicion it happened on that one occasion when I was out gold panning in the middle of the night under a full moon and was brutally attacked by a man-sized bat that bit me several times on the neck. At least, that’s my best guess. But, whatever–now that my disease is in full bloom, it’s not so bad any more. I’ve become mostly nocturnal, and I have to snack on the blood of an unsuspecting undergraduate student once every month or two to keep from wasting away. But it seems like a small price to pay in return for eternal life, superhuman strength, and really pasty skin.
  • Overall, I’m enjoying myself quite a bit. I recently read somewhere that people are, on average, happiest in their 30s. I also recently read somewhere else that people are, on average, least happy in their 30s. I resolve this apparent contradiction by simply opting to believe the first thing, because in my estimation, I am, on average, happiest in my 30s.

Ok, enough self-indulgent rambling. Looking over this list, it wasn’t even a very eventful eight months, so I really have no excuse for dropping the ball on this blogging thing. I will now attempt to resume posting one to two posts a month about brain imaging, correlograms, and schweizel units. This might be a good cue for you to hit the UNSUBSCRIBE button.

R, the master troll of statistical languages

Warning: what follows is a somewhat technical discussion of my love-hate relationship with the R statistical language, in which I somehow manage to waste 2,400 words talking about a single line of code. Reader discretion is advised.

I’ve been using R to do most of my statistical analysis for about 7 or 8 years now–ever since I was a newbie grad student and one of the senior grad students in my lab introduced me to it. Despite having spent hundreds (thousands?) of hours in R, I have to confess that I’ve never set aside much time to really learn it very well; what basic competence I’ve developed has been acquired almost entirely by reading the inline help and consulting the Oracle of Bacon Google when I run into problems. I’m not very good at setting aside time for reading articles or books or working my way through other people’s code (probably the best way to learn), so the net result is that I don’t know R nearly as well as I should.

That said, if I’ve learned one thing about R, it’s that R is all about flexibility: almost any task can be accomplished in a dozen different ways. I don’t mean that in the trivial sense that pretty much any substantive programming problem can be solved in any number of ways in just about any language; I mean that for even very simple and well-defined tasks involving just one or two lines of code there are often many different approaches.

To illustrate, consider the simple task of selecting a column from a data frame (data frames in R are basically just fancy tables). Suppose you have a dataset that looks like this:

In most languages, there would be one standard way of pulling columns out of this table. Just one unambiguous way: if you don’t know it, you won’t be able to work with data at all, so odds are you’re going to learn it pretty quickly. R doesn’t work that way. In R there are many ways to do almost everything, including selecting a column from a data frame (one of the most basic operations imaginable!). Here are four of them:

 

I won’t bother to explain all of these; the point is that, as you can see, they all return the same result (namely, the first column of the ice.cream data frame, named ‘flavor’).

This type of flexibility enables incredibly powerful, terse code once you know R reasonably well; unfortunately, it also makes for an extremely steep learning curve. You might wonder why that would be–after all, at its core, R still lets you do things the way most other languages do them. In the above example, you don’t have to use anything other than the simple index-based approach (i.e., data[,1]), which is the way most other languages that have some kind of data table or matrix object (e.g., MATLAB, Python/NumPy, etc.) would prefer you to do it. So why should the extra flexibility present any problems?

The answer is that when you’re trying to learn a new programming language, you typically do it in large part by reading other people’s code–and nothing is more frustrating to a newbie when learning a language than trying to figure out why sometimes people select columns in a data frame by index and other times they select them by name, or why sometimes people refer to named properties with a dollar sign and other times they wrap them in a vector or double square brackets. There are good reasons to have all of these different idioms, but you wouldn’t know that if you’re new to R and your expectation, quite reasonably, is that if two expressions look very different, they should do very different things. The flexibility that experienced R users love is very confusing to a newcomer. Most other languages don’t have that problem, because there’s only one way to do everything (or at least, far fewer ways than in R).

Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all. But I was reminded of the steepness of that initial learning curve the other day while helping my wife use R to do some regression analyses for her thesis. Rather than explaining what she was doing, suffice it to say that she needed to write a function that, among other things, takes a data frame as input and retains only the numeric columns for subsequent analysis. Data frames in R are actually lists under the hood, so they can have mixed types (i.e., you can have string columns and numeric columns and factors all in the same data frame; R lists basically work like hashes or dictionaries in other loosely-typed languages like Python or Ruby). So you can run into problems if you haphazardly try to perform numerical computations on non-numerical columns (e.g., good luck computing the mean of ‘cat’, ‘dog’, and ‘giraffe’), and hence, pre-emptive selection of only the valid numeric columns is required.

Now, in most languages (including R), you can solve this problem very easily using a loop. In fact, in many languages, you would have to use an explicit for-loop; there wouldn’t be any other way to do it. In R, you might do it like this*:

numeric_cols = rep(FALSE, ncol(ice.cream))
for (i in 1:ncol(ice.cream)) numeric_cols[i] = is.numeric(ice.cream[,i])

We allocate memory for the result, then loop over each column and check whether or not it’s numeric, saving the result. Once we’ve done that, we can select only the numeric columns from our data frame with data[,numeric_cols].

This is a perfectly sensible way to solve the problem, and as you can see, it’s not particularly onerous to write out. But of course, no self-respecting R user would write an explicit loop that way, because R provides you with any number of other tools to do the job more efficiently. So instead of saying “just loop over the columns and check if is.numeric() is true for each one,” when my wife asked me how to solve her problem, I cleverly said “use apply(), of course!”

apply() is an incredibly useful built-in function that implicitly loops over one or more margins of a matrix; in theory, you should be able to do the same work as the above two lines of code with just the following one line:

apply(ice.cream, 2, is.numeric)

Here the first argument is the data we’re passing in, the third argument is the function we want to apply to the data (is.numeric()), and the second argument is the margin over which we want to apply that function (1 = rows, 2 = columns, etc.). And just like that, we’ve cut the length of our code in half!

Unfortunately, when my wife tried to use apply(), her script broke. It didn’t break in any obvious way, mind you (i.e., with a crash and an error message); instead, the apply() call returned a perfectly good vector. It’s just that all of the values in that vector were FALSE. Meaning, R had decided that none of the columns in my wife’s data frame were numeric–which was most certainly incorrect. And because the code wasn’t throwing an error, and the apply() call was embedded within a longer function, it wasn’t obvious to my wife–as an R newbie and a novice programmer–what had gone wrong. From her perspective, the regression analyses she was trying to run with lm() were breaking with strange messages. So she spent a couple of hours trying to debug her code before asking me for help.

Anyway, I took a look at the help documentation, and the source of the problem turned out to be the following: apply() only operates over matrices or vectors, and not on data frames. So when you pass a data frame to apply() as the input, it’s implicitly converted to a matrix. Unfortunately, because matrices can only contain values of one data type, any data frame that has at least one string column will end up being converted to a string (or, in R’s nomenclature, character) matrix. And so now when we apply the is.numeric() function to each column of the matrix, the answer is always going to be FALSE, because all of the columns have been converted to character vectors. So apply() is actually doing exactly what it’s supposed to; it’s just that it doesn’t deign to tell you that it’s implicitly casting your data frame to a matrix before doing anything else. The upshot is that unless you carefully read the apply() documentation and have a basic understanding of data types (which, if you’ve just started dabbling in R, you may well not), you’re hosed.

At this point I could have–and probably should have–thrown in the towel and just suggested to my wife that she use an explicit loop. But that would have dealt a mortal blow to my pride as an experienced-if-not-yet-guru-level R user. So of course I did what any self-respecting programmer does: I went and googled it. And the first thing I came across was the all.is.numeric() function in the Hmisc package which has the following description:

Tests, without issuing warnings, whether all elements of a character vector are legal numeric values.

Perfect! So now the solution to my wife’s problem became this:

library(Hmisc)
apply(ice.cream, 2, all.is.numeric)

…which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution. So I googled some more. And came across a relevant Stack Exchange answer, which had the following simple solution to my wife’s exact problem:

sapply(ice.cream, is.numeric)

You’ll notice that this is virtually identical to the apply() approach that crashed. That’s no coincidence; it turns out that sapply() is just a variant of apply() that works on lists. And since data frames are actually lists, there’s no problem passing in a data frame and iterating over its columns. So just like that, we have an elegant one-line solution to the original problem that doesn’t invoke any loops or third-party packages.

Now, having used apply() a million times, I probably should have known about sapply(). And actually, it turns out I did know about sapply–in 2009. A Spotlight search reveals that I used it in some code I wrote for my dissertation analyses. But that was 2009, back when I was smart. In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. In other words, in 2012, I’m the kind of experienced R user that you might generously call “not very good at R”, and, less generously, “dumb”.

On the plus side, the end product is undeniably cool, right? There are very few languages in which you could achieve so much functionality so compactly right out of the box. And this isn’t an isolated case; base R includes a zillion high-level functions to do similarly complex things with data in a fraction of the code you’d need to write in most other languages. Once you throw in the thousands of high-quality user-contributed packages, there’s nothing else like it in the world of statistical computing.

Anyway, this inordinately long story does have a point to it, I promise, so let me sum up:

  • If I had just ignored the desire to be efficient and clever, and had told my wife to solve the problem the way she’d solve it in most other languages–with a simple for-loop–it would have taken her a couple of minutes to figure out, and she’d probably never have run into any problems.
  • If I’d known R slightly better, I would have told my wife to use sapply(). This would have taken her 10 seconds and she’d definitely never have run into any problems.
  • BUT: because I knew enough R to be clever but not enough R to avoid being stupid, I created an entirely avoidable problem that consumed a couple of hours of my wife’s time. Of course, now she knows about both apply() and sapply(), so you could argue that in the long run, I’ve probably still saved her time. (I’d say she also learned something about her husband’s stubborn insistence on pretending he knows what he’s doing, but she’s already the world-leading expert on that topic.)

Anyway, this anecdote is basically a microcosm of my entire experience with R. I suspect many other people will relate. Basically what it boils down to is that R gives you a certain amount of rope to work with. If you don’t know what you’re doing at all, you will most likely end up accidentally hanging yourself with that rope. If, on the other hand, you’re a veritable R guru, you will most likely use that rope to tie some really fancy knots, scale tall buildings, fashion yourself a space tuxedo, and, eventually, colonize brave new statistical worlds. For everyone in between novice and guru (e.g., me), using R on a regular basis is a continual exercise in alternately thinking “this is fucking awesome” and banging your head against the wall in frustration at the sheer stupidity (either your own, or that of the people who designed this awful language). But the good news is that the longer you use R, the more of the former and the fewer of the latter experiences you have. And at the end of the day, it’s totally worth it: the language is powerful enough to make you forget all of the weird syntax, strange naming conventions, choking on large datasets, and issues with data type conversions.

Oh, except when your wife is yelling at gently reprimanding you for wasting several hours of her time on a problem she could have solved herself in 5 minutes if you hadn’t insisted that she do it the idiomatic R way. Then you remember exactly why R is the master troll of statistical languages.

 

 

* R users will probably notice that I use the = operator for assignment instead of the <- operator even though the latter is the officially prescribed way to do it in R (i.e., a <- 2 is favored over a = 2). That’s because these two idioms are interchangeable in all but one (rare) use case, and personally I prefer to avoid extra keystrokes whenever possible. But the fact that you can do even basic assignment in two completely different ways in R drives home the point about how pathologically flexible–and, to a new user, confusing–the language is.

in which I apologize for my laziness, but not really

I got back from the Cognitive Neuroscience Society meeting last week. I was planning to write a post-CNS wrap-up thing like I did last year and the year before that, but I seem to have misplaced the energy that’s supposed to fuel such an exercise. So instead I’ll just say I had a great time and leave it at that. What happens in Chicago stays in Chicago, etc. etc.

Also, I really appreciate all the people who came up to me at CNS and said nice things about this blog–it’s nice to know that someone actually reads this (puzzling, mind you, because I’m not sure why anyone reads this, but nice nonetheless). A couple of people encouraged me to blog more often, so I’m making an effort to do that, though the most likely outcome will be miserable failure. Either that or I’ll just start pasting random YouTube videos in this space. Like this one:

p.s. on re-reading that, it kind of make it sound like I was swarmed by adoring fans at CNS. To clarify: “all the people” means, like, four people, and the “nice things” were really more like lukewarm “oh yeah, your blog’s not totally awful” sentiments.

p.p.s. I’ve noticed that a lot of my shorter posts take the form of “I was going to write about X, but I’m not actually going to write about X.” I think this is because I’m very lazy but still want partial credit for having good intentions. Which is kind of ridiculous.