Of CHI and Turk

In the last couple of years Mechanical Turk has gained more and more traction as a tool for doing HCI work, from Kittur et al’s seminal paper in 2008, to papers this year looking at crowdsourcing visual perception studies and at assessing worker quality. In fact there seem to be so many studies incorporating the Turk in some way that I almost undoubtedly won’t cover them all here. Just a short round-up:

Julie S. Downs et al. Are Your Participants Gaming the System? Screening Mechanical Turk Workers.

Julie and her co-authors were interested in how to use qualifications to sort out the good from the not-so-good workers on MTurk. They did this by having each worker complete a qualification task consisting of two questions, an easy one and a harder one. Each question had a conscious distractor, but the hard question really required that you carefully read through some of the text in the question.

Interesting findings

  • Only 61% of participants answered both questions correctly
  • Women tended to get the harder question correct more often than men (66% vs. 60%)
  • Older participants were more likely to qualify than younger (and young men were particularly less likely to qualify).
  • Non-qualifiers completed the task about 20 seconds faster than qualifiers.

Nick Diakopoulos and Ayman Shamma. Characterizing Debate Performance via Aggregate Twitter Sentiment.

Even though our paper was not about crowdsourcing with Mechanical Turk per se, we did employ MT as a method for getting sentiment ratings of Twitter messages. We applied some filters to try to identify not only lousy workers, but also lousily completed tasks. For the task filters we employed a temporal filter (if it was completed too fast it’s suspicious), a sloppiness filter (if it’s missing some ratings for a task we suspect that the worker was being sloppy), a control filter (if a control message is rated incorrectly with respect to some ground truth then we suspect that the worker is not deeply processing the message), a worker bias filter (if the worker tends to favor one category more than is reasonable to do), and an overall worker quality filter which looks at the ratio of ratings retained after the above filters to the the number of ratings discarded by them and if this ratio is below .5 then we discard all the other ratings from this worker.

Even considering all of the filtering we did, our reliability only got up to about 0.66, which in academic circles is moderately good, but certainly not great. In future work we want to see how we can push this up a bit more and more precisely characterize how and which filters work best.

Jeff Heer and Michael Bostock. Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design.

Jeff and Michael did a great job of contributing to our understanding of the validity of experiments run on mechanical turk, in particular in relation to visual perception experiments. In contrast to almost every other mechanical turk study I’ve seen they didn’t report any major issues with the quality of the results they were getting. Ultimately they used qualification tasks, but even without them found about 90% of the results accurate / useful. This is surprising in relation to Downs et al’s study since we might expect there to be a lot more noise. In fact Heer and Bostock did report higher variances than might be expected in a lab environment, but it’s unclear whether this is coming from a larger range of display configurations or perhaps just noisier participants. My hunch is that with some clever filtering and better qualification they might push that variance down.

But their study raises an even more important question which relates to task variability and quality. In Heer’s case the tasks were arguably pre-attentive, with the workers’ brains doing the perceptual tasks virtually effortlessly, whereas in my study and in Downs’ the tasks incorporated deeper levels of attention and information processing. In some follow-up work that we’re doing at Rutgers on this, we’re exploring how we can get better control and quality for the types of experiments that do require deeper attention.