Tag Archives: CHI

Systems Papers at CHI – Some Data

Back in 2009 James Landay wrote a thoughtful piece on some of the challenges associated with publishing systems research at a venue like CHI (or UIST). He concluded that the incentive structure just isn’t there to support the greater degree of time and effort needed to build and evaluate systems, especially when compared to other types of research which require less time but still get you the line-item on the CV.

I wanted to try to back up some of this thinking with data, so I wrote a ScraperWiki script to go out and harvest a corpus of previous CHI proceedings (you can edit the script or access the data I collected here). I scraped all paper titles, authors, and abstracts going back to 1999 (the ACM DL changes their page format before then which is why I didn’t go back further). The dataset ended up being 2,498 papers over 14 years (1999-2012)

For the sake of the rest of the analysis I define “systems papers” as the subset of papers with an abstract that uses the word “system”. I know it’s not perfect (most likely some false positives in there), but it’s a reasonable proxy and I didn’t have time to go through all 2.5k papers by hand.

One question we might ask is: Do systems papers really require more effort than other papers at CHI? If they take too much effort, a rational researcher might choose to spend time on other types of contributions. In the following graph we can see that, in the last 5 years, systems papers have indeed averaged more authors per paper than other papers at CHI (and an assumption is that more authors implies more overall work, though this of course doesn’t always hold). There have also been years in the past when non-systems papers have had more authors on average (e.g. 2001 or 2002). Overall the number of authors for systems papers over the period (M=3.61, SD 0.37) is slightly higher than that for non-systems papers (M=3.43, SD=0.21), and the standard deviation is also a bit higher indicating there is more variance in the number of authors of systems papers. The difference in means isn’t statistically significant (p=.15). So it seems there is some (weak) evidence that systems papers do have more authors on average.

Another question we might ask is: Is the relative amount of systems work published at CHI declining? To see this we can look at the graph below which shows the fraction of systems papers out of the total for each year. The average fraction of systems papers over the time period (1999-2012) is 0.36 (SD = 0.07). There’s a fair bit of variance with a low in 2007 and a high in 2003. In the last couple years the fraction of systems papers has been a tad below the mean, but still within one standard deviation. There’s no correlation between fraction and year. From this I think we can conclude that there’s no clear trend in fraction of systems papers being published at CHI. Moreover, the absolute number of systems papers has gone from 15 in 1999 to 60 in 2012, indicating fair growth in this segment of CHI papers. (It would be really interesting to analyze abstracts from all papers both accepted and rejected to see if there is a bias).

While the cost of doing systems work in HCI may be higher (i.e. more co-authors needed), the fraction of systems work at CHI doesn’t seem to have been substantially affected over the course of the last 14 years. But it’s still easy to feel like all the action is happening in industry: new products are constantly hitting the market and start-ups and entrepreneurship and heavily covered by the tech press. The reality is that systems publishing is trucking along and also growing, but, I think, over time will represent a smaller and smaller fraction of the pie as prototyping becomes “mainstream” and knowledge of HCI continues to diffuse. That may be ok, as long as the research prototypes produced by the academy are sufficiently differentiated to what’s available and possible in the market.

Of CHI and Turk

In the last couple of years Mechanical Turk has gained more and more traction as a tool for doing HCI work, from Kittur et al’s seminal paper in 2008, to papers this year looking at crowdsourcing visual perception studies and at assessing worker quality. In fact there seem to be so many studies incorporating the Turk in some way that I almost undoubtedly won’t cover them all here. Just a short round-up:

Julie S. Downs et al. Are Your Participants Gaming the System? Screening Mechanical Turk Workers.

Julie and her co-authors were interested in how to use qualifications to sort out the good from the not-so-good workers on MTurk. They did this by having each worker complete a qualification task consisting of two questions, an easy one and a harder one. Each question had a conscious distractor, but the hard question really required that you carefully read through some of the text in the question.

Interesting findings

  • Only 61% of participants answered both questions correctly
  • Women tended to get the harder question correct more often than men (66% vs. 60%)
  • Older participants were more likely to qualify than younger (and young men were particularly less likely to qualify).
  • Non-qualifiers completed the task about 20 seconds faster than qualifiers.

Nick Diakopoulos and Ayman Shamma. Characterizing Debate Performance via Aggregate Twitter Sentiment.

Even though our paper was not about crowdsourcing with Mechanical Turk per se, we did employ MT as a method for getting sentiment ratings of Twitter messages. We applied some filters to try to identify not only lousy workers, but also lousily completed tasks. For the task filters we employed a temporal filter (if it was completed too fast it’s suspicious), a sloppiness filter (if it’s missing some ratings for a task we suspect that the worker was being sloppy), a control filter (if a control message is rated incorrectly with respect to some ground truth then we suspect that the worker is not deeply processing the message), a worker bias filter (if the worker tends to favor one category more than is reasonable to do), and an overall worker quality filter which looks at the ratio of ratings retained after the above filters to the the number of ratings discarded by them and if this ratio is below .5 then we discard all the other ratings from this worker.

Even considering all of the filtering we did, our reliability only got up to about 0.66, which in academic circles is moderately good, but certainly not great. In future work we want to see how we can push this up a bit more and more precisely characterize how and which filters work best.

Jeff Heer and Michael Bostock. Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design.

Jeff and Michael did a great job of contributing to our understanding of the validity of experiments run on mechanical turk, in particular in relation to visual perception experiments. In contrast to almost every other mechanical turk study I’ve seen they didn’t report any major issues with the quality of the results they were getting. Ultimately they used qualification tasks, but even without them found about 90% of the results accurate / useful. This is surprising in relation to Downs et al’s study since we might expect there to be a lot more noise. In fact Heer and Bostock did report higher variances than might be expected in a lab environment, but it’s unclear whether this is coming from a larger range of display configurations or perhaps just noisier participants. My hunch is that with some clever filtering and better qualification they might push that variance down.

But their study raises an even more important question which relates to task variability and quality. In Heer’s case the tasks were arguably pre-attentive, with the workers’ brains doing the perceptual tasks virtually effortlessly, whereas in my study and in Downs’ the tasks incorporated deeper levels of attention and information processing. In some follow-up work that we’re doing at Rutgers on this, we’re exploring how we can get better control and quality for the types of experiments that do require deeper attention.