I’ve recently started recruiting participants for online experiments via Mechanical Turk. In the past I’ve always either relied on on directory listings (like this one) or targeted specific populations (e.g., bloggers and twitterers) via email solicitation. But recently I’ve started running a very large-sample decision-making study (it’s here, if you care to contribute to the sample), and waiting for participants to trickle in via directories isn’t cutting it. So I’ve started paying people (very) small amounts of money for participation.
One challenge I’ve had to deal with is figuring out how to filter out participants who aren’t really interested in contributing to science, and are strictly in it for the money. 20 or 30 cents is a pittance to most people in the developed world, but as I’ve found out the hard way, gaming MTurk appears to be a thriving business in some developing countries (some of which I’ve unfortunately had to resort to banning entirely). Cheaters aren’t so much of an issue for very quick tasks like providing individual ratings of faces, because (a) the time it takes to give a fake rating isn’t substantially greater than giving one’s actual opinion, and (b) the standards for what counts as accurate performance are clear, so it’s easy to train workers and weed out the bad apples. Unfortunately, my studies generally involve fairly long personality questionnaires combined with other cognitive tasks (e.g., in the current study, you get to repeatedly allocate hypothetical money between yourself and a computer partner, and rate some faces). They often take around half an hour, and involve 20+ questions per screen, so there’s a pretty big incentive for workers who are only in it for the cash to produce random responses and try to increase their effective wage. And the obvious question then is how to detect cheating in the data.
One of the techniques I’ve found works surprisingly well is to simply compare each person’s pattern of responses across items with the mean for the entire sample. In other words, you just compute the correlation between each individual’s item scores and the means for all the items scores across everyone who’s filled out the same measure. I know that there’s an entire literature on this stuff full of much more sophisticated ways to detect random responding, but I find this crude approach really does quite well (I’ve verified this by comparing it with a bunch of other similar metrics), and has the benefit of being trivial to implement.
Anyway, one of the things that surprised me when I first computed these correlations is just how strong the relationship between the sample mean and most individuals’ responses is. Here’s what the distribution looks like for one particular inventory, the 181-item Analog to Multiple Broadband Inventories (AMBI, whichI introduced in this paper, and discuss further here):
This is based on a sample of about 600 internet respondents, which actually turns out to be pretty representative of the broader population, as Sam Gosling, Simine Vazire, and Sanjay Srivastava will tell you (for what it’s worth, I’ve done the exact same analysis on a similar-sized off-line dataset from Lew Goldberg’s Eugene-Springfield Community Sample (check out that URL!) and obtained essentially the same results). In this sample, the median correlation is .48; so, in effect, you can predict a quarter of the variance in a typical participant’s responses without knowing anything at all about them. Human beings, it turns out, have some things in common with one another (who knew?). What you think you’re like is probably not very dissimilar to what I think I’m like. Which is kind of surprising, considering you’re a well-adjusted, friendly human being, and I’m a real freakshow somewhat eccentric, paranoid kind of guy.
What drives that similarity? Much of it probably has to do with social desirability–i.e., many of the AMBI items (and those on virtually all personality inventories) are evaluatively positive or negative statements that most people are inclined to strongly agree or disagree with. But it seems to be a particular kind of social desirability–one that has to do with openness to new experiences, and particular intellectual ones. For instance, here are the top 10 most endorsed items (based on mean likert scores across the entire sample; scores are in parentheses):
- like to read (4.62)
- like to visit new places (4.39)
- was a better than average student when I was in school (4.28)
- am a good listener (4.25)
- would love to explore strange places (4.22)
- am concerned about others (4.2)
- am open to new experiences (4.18)
- amuse my friends (4.16)
- love excitement (4.08)
- spend a lot of time reading (4.07)
And conversely, here are the 10 least-endorsed items:
- was a slow learner in school (1.52)
- don’t think that laws apply to me (1.8)
- do not like to visit museums (1.83)
- have difficulty imagining things (1.84)
- have no special urge to do something original (1.87)
- do not like art (1.95)
- feel little concern for others (1.97)
- don’t try to figure myself out (2.01)
- break my promises (2.01)
- make enemies (2.06)
You can see a clear evaluative component in both lists: almost everyone believes that they’re concerned about others and thinks that they’re smarter than average. But social desirability and positive illusions aren’t enough to explain these patterns, because there are plenty of other items on the AMBI that have an equally strong evaluative component–for instance, “don’t have much energy”, “cannot imagine lying or cheating”, “see myself as a good leader”, and “am easily annoyed”–yet have mean scores pretty close to the midpoint (in fact, the item ‘am easily annoyed’ is endorsed more highly than 107 of the 181 items!). So it isn’t just that we like to think and say nice things about ourselves; we’re willing to concede that we have some bad traits, but maybe not the ones that have to do with disliking cultural and intellectual experiences. I don’t have much of an idea as to why that might be, but it does introspectively feel to me like there’s more of a stigma about, say, not liking to visit new places or experience new things than admitting that you’re kind of an irritable person. Or maybe it’s just that many of the openness items can be interpreted more broadly than the other evaluative items–e.g., there are lots of different art forms, so almost everyone can endorse a generic “I like art” statement. I don’t really know.
Anyway, there’s nothing the least bit profound about any of this; if anything, it’s just a nice reminder that most of us are not really very good at evaluating where we stand in relation to other people, at least for many traits (for more on that, go read Simine Vazire’s work). The nominal midpoint on most personality scales is usually quite far from the actual median in the general population. This is a pretty big challenge for personality psychology, and if we could figure out how to get people to rank themselves more accurately relative to other people on self-report measures, that would be a pretty huge advance. But it seems quite likely that you just can’t do it, because people simply may not have introspective access to that kind of information.
Fortunately for our ability to measure individual differences in personality, there are plenty of items that do show considerable variance across individuals (actually, in fairness, even items with relatively low variance like the ones above can be highly discriminative if used properly–that’s what item response theory is for). Just for kicks, here are the 10 AMBI items with the largest standard deviations (in parentheses):
- disliked math in school (1.56)
- wanted to run away from home when I was a child (1.56)
- believe in a universal power or god (1.53)
- have felt contact with a divine power (1.51)
- rarely cry during sad movies (1.46)
- am able to fix electrical-wiring problems (1.46)
- am devoted to religion (1.44)
- shout or scream when I’m angry (1.43)
- love large parties (1.42)
- felt close to my parents when I was a child (1.42)
So now finally we come to the real moral of this post… that which you’ve read all this long way for. And the moral is this, grasshopper: if you want to successfully pick a fight at a large party, all you need to do is angrily yell at everyone that God told you math sucks.
I’ve completed surveys on M. Turk. In this case, I clicked on the link in your post to altruistically contribute to science, but I absolutely hated the game, so I didn’t finish it.
Tal,
Great post., i really enjoyed it.
That being said, have you considered the effects of immediate feedback on your results? Perhaps people rate themselves poorly at first, but with immediate feedback they could learn to be better at it?
Its probably worth looking at (by someone, possibly me).
Jude, I’m sorry you didn’t like it, but thanks for taking the time out to participate! If you have any suggestions for ways to make it more bearable, I’m happy to try to integrate them into the next revision. 🙂
disgruntled, it’s tricky, because you don’t necessarily want to bias subjects to respond more like the mean… after all, the point of personality measures is to determine what’s different about people. So it’s a fine line between throwing out people who clearly aren’t responding appropriately, and allowing for the fact that many people aren’t all that similar to everyone else. In practice, I don’t remove someone unless the correlation is right around zero and there are other indications of non-compliance with instructions (e.g., their face ratings also don’t align with normative ratings). I do give people feedback about their personality scores and how they stack up with everyone else who’s participated, but that’s provided at the very end, so as not to bias any other part of the study.
Tal,
Nice post on pervasive questions.
Maybe it’s getting a bit pernickety but I was wondering if you could report on the kurtosis: It occured to me that quite a few of the high-variance items were religion-related. A topic on which people tend to hold quite radical views, at least, I would think, more radical than on how much they felt close to their parent. Therefore I was wondering if you could separate those items where high variance comes from extreme ratings being endorsed by most people versus those receiving as-variable but more-evenly distributed ratings… (Another trick: compute variance on distance from the mean rating)
Just asking…
Isn’t there a problem here with potentially removing interesting outliers from your sample? This could potentially cause noise amplification couldn’t it? What about using trap questions to measure consistency across items?
knd, it’s hard to say, because the items are rated on a 5-point likert scale, so it’s not like you can really have extreme scores. For what it’s worth, the religion items are actually somewhat more leptokurtic (1.6 – 1.7 vs. 1.5 – 1.6 for the other top items), but I wouldn’t put much faith in those numbers given the limited set of possible values. All of the items in the high-sd list are characterized by more frequent endorsement of the extremes (1 and 5) than the midpoint (3).
CM, it’s a legitimate concern, but I don’t exclude participants based solely on this measure. They also have to fail other QC indices–e.g., completing the study too quickly, giving too consistent or too variable amounts in the decision-making task, not rating the faces normatively, and so on. If anything, I’m fairly liberal in what I allow in, since I’d rather let a few data points of noise in than refuse to pay a worker money they’ve legitimately earned. The odds of throwing out a legitimate but unusual participant are, in my estimation, exceedingly low. But the key element in any case is that I always exclude participants’ data prior to analysis, so the results remain unbiased even in an (unlikely) worst-case scenario where a couple of legitimate participants might be inadvertently excluded.
Adding catch questions to the questionnaire is possible in theory, and I think it’s a good approach, but because of the way the study is implemented, it would be a bit of a hassle. Plus it would likely require an IRB amendment, so it’s not worth the trouble at this point.
You leapt pretty quickly to trying to find some kind of social desirability explanation. Have you considered that this may be at least partially signal rather than noise/bias? I.e., that in your sample(s), most people actually do like to read and very few people hate museums?
Sanjay, I think that’s definitely the case. That’s basically what I meant when I suggested that the openness items seem broader and more amenable to interpretation. I suspect that if you picked any particular form of art, you wouldn’t get ringing endorsements from a majority of the population, but the fact that items like “I like to read” and “I do not like art” are so vague may mean that almost anyone can endorse them–and do so truthfully.
In any case, I think the broader point isn’t so much about signal vs. noise (in the sense that signal would be a ‘true’ score), it’s about discriminability. A good property for a personality measure to have is that the population be distributed across the entire nominal range of scores, and that’s clearly not the case for most broadband inventories; e.g., Openness scores are often substantially above the midpoint (as the items above illustrate). I don’t doubt that in some sense that may be because most people really are open to experience, as defined operationally. But I don’t know how interesting that is, because you could easily get the mass of the distribution to move up or down the scale as you like just by rewording the items slightly. So maybe my post should just be read as an endorsement of item response theory and nothing else… we could have openness items like “I can’t think of anything I like better than going to museums,” which would distribute scores more evenly–and render silly posts like this one moot!
Is some of the high correlation driven by the fact that lots of people who just don’t know what to say go for a “3” by default?
This is a bit off topic (and probably reveals my ignorance of this field) but I feel that a 4 point Likert scale has advantages over a 5 point one because it prevents that.
The Autism Quotient, for example, which is very widely used, has 4 points but it actually treats 1 and 2 as equivalent, likewise for 3 and 4. so it’s essentially a binary scale but when you’re filling it out it doesn’t feel like one, because you can say you’re only *slightly* more likely to want to go to a library than go to a party, or only slightly less likely; rather than having to decide in a black or white fashion.