Category Archives: Uncategorized

Data on the Growth of CitiBike

On May 27th New York City launched its city-wide bike sharing program, CitiBike. I tried it out last weekend; it was great, aside from a few glitches checking-out and checkin-in the bikes. It made me curious about the launch of the program and how it’s growing, especially since the agita between bikers and drivers is becoming quite palpable. Luckily, the folks over at the CitiBike blog have been posting daily stats about the number of rides every day, average duration of rides, and even the most popular station for starting and stopping a ride. If you’re interested in hacking more on the data there’s even a meetup happening next week.

Below is my simple line chart of the total number of daily riders (they measure that as of 5pm that day). Here’s the data. You might look at the graph and wonder, “what happened June 7th?”. That was the monsoon we had. Yeah, turns out bikers don’t like rain.

citibike2

Mobile Gaming Summit 2012

I have recently been getting more into mobile design and development and so was excited to attend the Mobile Gaming Summit in New York today. It was a well attended event, with what seemed like dozens of presenters from top mobile studios sharing tips on everything from user acquisition to design, mobile analytics, cross-platform development, finance, and social. What I wanted to share here quickly were some of the resources that were mentioned at the summit because I think they would be useful to any mobile studio / developer who’s just starting out (noobs like me!). So, by topic, here are some services to check out:

  • Ad Platforms for user acquisition
  • Analytics
    • Flurry (free analytics platform to help you understand how users are using your app)
    • Bees and Pollen (analytics to help optimize the user experience based on the user)
    • Apsalar
  • Cross-Platform Technologies
    • Corona (uses a language called Lua that I’ve never heard of)
    • Marmelade (program in c++, deploy to iOS, Android, xbox, etc.)
    • Phone Gap (program in javascript, HTML, CSS)
    • Unity (geared toward 3D games)

In general I was impressed with the amount of data driven design going on in the mobile apps / games space and how the big studios are really optimizing for attention, retention, and monetization by constantly tweaking things.

Other tips that were shared included things like: use Canada as a test market to work out kinks in your apps before you launch in the larger U.S. market; concentrate marketing efforts / budget in a short period of time to attain the highest rank in the app store as this drives more organic growth; the industry is heavily moving towards a free-to-play model with monetization done with in-app purchases or advertising.

In the next few weeks I’ll be excited to try out some of these services with my new app, Many Faces, which launched a couple weeks ago. I think it’s all about the user-acquisition / marketing at this point …

Comment Readers Want Relevance!

A couple years ago now I wrote a paper about the quality of comments on online news stories. For the paper I surveyed a number of commenters on sacbee.com about their commenting experience on that site. One of the aspects of the experience that users complained about was that comments were often off-topic: that comments weren’t germane, or relevant, to the conversation or to the article to which they were attached. This isn’t surprising, right? If you’ve ever read into an online comment thread you know there’s a lot of irrelevant things that people are posting.

It stands to reason then that if we can make news comments more relevant then people might come away more satisfied from the online commenting experience; that they might be more apt to read and find and learn new things if the signal to noise ratio was a bit higher. The point of my post here is to show you that there’s a straightforward and easy-to-implement way to provide this relevance that coincides with both users’ and editors notions of “quality comments”.

I collected data in July via the New York Times API, including 370 articles and 76,086 comments oriented around the topic of climate change. More specifically I searched for articles containing the phrase “climate change” and then collected all articles which had comments (since not all NYT articles have comments). For each comment I also had a number of pieces of metadata, including: (1) the number of times the comment was “recommended” by someone upvoting it, and (2) whether the comment was an “editor’s selection”. Both of these ratings indicate “quality”; one from the users’ point of view and the other from the editors’. And both of these ratings in fact correlate with a simple measure of relevance as I’ll describe next.

In the dataset I collected I also had the full text of both the comments and the articles. Using some basic IR ninjitsu I then normalized the text, stop-worded it (using NLTK), and stemmed the words using the Porter stemming algorithm. This leaves us with cleaner, less noisy text to work with. I then computed relevance between each comment and its parent article by taking the dot product (cosine distance) of unigram feature vectors of tf-idf scores. For the sake of the tf-idf scores, each comment was considered a document, and only unigrams that occurred at least 10 times in the dataset were considered in the feature vectors (again to reduce noise). The outcome of this process is that for each comment-article pair I now had a score (between 0 and 1) representing similarity in the words used in the comment and those used in the article. So a score of 1 would indicate that the comment and article were using identical vocabulary whereas a score of 0 would indicate that the comment and article used no words in common.

So, what’s interesting is that this simple-to-compute metric for relevance is highly correlated to the recommendation score and editor’s selection ratings mentioned above. The following graph shows the average comment to article similarity score over each recommendation score up to 50 (red dots), and a moving average trend line (blue).

As you get into the higher recommendation scores there’s more variance because it’s averaging less values. But you can see a clear trend that as the number of recommendation ratings increases so too does the average comment to article similarity. In statistical terms, Pearson’s correlation is r=0.58 (p < .001). There’s actually a fair amount of variance around each of those means though, and the next graph shows the distribution of similarity values for each recommendation score. If you turn your head side-ways each column is a histogram of the similarity values.

We can also look at the relationship between comment to article similarity in terms of editors’ selections, certain comments that have been elevated  in the user interface by editors. The average similarity for comments that are not editors’ selections is 0.091 (N=73,723) whereas for comments that are editors’ selections the average is 0.118 (N=2363). A t-test between these distributions indicates that the difference in means is statistically significant (p < .0001). So what we learn from this is that editors’ criteria for selecting comments also correlates to the similarity in language used between the comment and article.

The implications of these findings are relatively straightforward. A simple metric of similarity (or relevance) correlates well to notions of “recommendation” and editorial selection. This metric could be surfaced in a commenting system user interface to allow users to rank comments based on how similar they are to an article, without having to wait for recommendation scores or editorial selections. In the future I’d like to look into ways to assess how predicative such metrics are in terms of recommendation scores, as well as try out different metrics of similarity, like KL divergence.

Does Local Journalism Need to Be Locally Sustainable?

The last couple of weeks have seen the rallying cries of journalists echo online as they call for support of the Homicide Watch Kickstarter campaign. The tweets “hit the fan” so to speak, Clay Shirky implored us to not let the project die, and David Carr may have finally tipped the campaign with his editorial questioning foundations’ support for Big News at the expense of funding more nimble start-ups like Homicide Watch.

It seems like a good idea too – providing more coverage of a civically important issue – and one that’s underserved to boot. But is it sustainable? As Jeff Sonderman at Poynter wrote about the successful Kickstarter campaign, “The $40,000 is not a sustainable endowment, just a stopgap to fund intern staffing for one year.”

For Homicide Watch to be successful at franchising to other cities (i.e. by selling a platform) each of those franchises itself needs to be sustained. This implies that, on a local level, either enough advertising buy-in, local media support, or crowdfunding (a la Kickstarter) would need to be generated to pay those pesky labor costs, the most expensive cost in most content businesses.

Here’s the thing. Even though Homicide Watch was funded, it struggled to get there, mostly surviving on the good-natured altruism of the media elite. I doubt that local franchises will be able to repeat that trick. Here’s why: most of the donors who gave to Homicide Watch were from elsewhere in the U.S. (68%) or from other countries (10%). Only  22% of donors where from DC, Virginia, or Maryland (see below for details on where the numbers come from). But this means that people local to Washington, DC, those who ostensibly would have the most to gain from a project like this, barely made up more than a fifth of the donors. Other local franchises probably couldn’t count on the kind of national attention that the media elite brought to the Homicide Watch funding campaign, nor could they count on the national interest afforded to the nation’s capital.

You might argue that for something like this to flourish it needs local support, from the people who would get the real utility of the innovation. At least Homicide Watch got a chance to prove itself out, but we’ll have to wait to see if it can make a sustainable business and provide real information utility at a local level. The numbers at this stage would seem to suggest it’s got an uphill battle ahead of it.

Stats
Here’s how I got the stats I quoted above. I made a Scraper wiki script to collect all of the donors on the Homicide Watch Kickstarter page (there were 1,102 as of about noon on 9/12). Of those 1102, 270 donors had geographic information (city, state, country). The stats quoted above are based on those 270 geotagged donors. Of course, that’s only about 25% of the total donors, so an assumption that I make above is that the 75%, the non-geotagged donors, follow a similar geographic distribution (and donation magnitude distribution) as the geotagged ones. I can’t think of a reason that assumption might not be true. For kicks I put the data up on Google Fusion Tables (it’s so awful, please, someone fix that!) so here’s a map of what states donors come from.

Visualization Performance in the Browser

I’ve recently embarked on a new project that involves visualizing and animating some potentially large networks as part of a browser-based information tool. So, I wanted to compare some of the different javascript visualization libraries out there to see how their performance scales. There are tons of options for doing advanced graphics in the browser nowadays including SVG-based solutions like D3, and Raphael, as well as HTML5 canvas solutions like processing.js, the javascript infovis toolkit, sigma.js and fabric.js.

There are certain benefits and trade-offs between SVG and Canvas. For instance canvas has performance that scales with the size of the image area. SVG performance instead scales with the complexity and size of the scenegraph. It also allows for control of elements via the DOM and CSS and has much better support for interactivity (i.e. every visual object can have event listeners). This sketch from D3 creator Mike Bostock shows that D3 performance can render 500 animated circles in SVG at a resolution of 960×500 at about ~40 FPS in Chrome, whereas rendering the same via the Canvas element was closer to ~30 FPS. Knowing what we know about how canvas scales, if the image area were less than 960 x 500, then canvas performance would increase, whereas SVG performance would not change. Of course, your mileage may vary depending on your browser and system – for instance this post found that processing.js (using canvas) outperformed D3 (using SVG) by 20-1000%.

To get a better feel for some of the performance trade-offs (and to take some of the different libraries for a test spin) I developed a quick comparison tool which lets you see performance for D3 (SVG), Sigma.js, Processing.js, and D3 (rendering to canvas) for different graph sizes (500-5,000 nodes, and 1,000-10,000 edges) on an image area of 600×600 pixels. On my system (MBP 2.4GHz, Chrome v.18) D3 (SVG) choked down to about 7 FPS with 1000 nodes and 2000 edges when 20% of nodes’ colors were gradually animated. For the same rig sigma.js could do 19 FPS and processing.js could do 11 FPS. Using D3 but then rendering to canvas did the best though: 23 FPS.

D3 seems like a great option given the rich set of utilities and functions available, as well as the option to efficiently render directly to canvas if you really need to scale up the number of objects in your scene. Of course this does undo some of the nice interactivity and manipulability features of using SVG …

 

News Headlines and Retweets

How do you maximize the reach and engagement of your tweets? This is a hugely important question for companies who want to maximize the value of their content. There are even start-ups, like Social Flow, that specialize in optimizing the “engagement” of tweets by helping to time them appropriately. A growing body of research is also looking at what factors, both of the social network and of the content of tweets, impact how often tweets get retweeted. For instance, some of this research has indicated that tweets are more retweeted when they contain URLs and hashtags, when they contain negative or exciting and intense sentiments, and when the user has more followers. Clearly time is important too and different times of day or days of week can also impact the amount of attention people are paying to social media (and hence the likelihood that something will get retweeted).

But aside from the obvious thing of growing their follower base, what can content creators like news organizations do to increase the retweetability of their tweets? Most news organizations basically tweet out headlines and links to their stories. And that delicate choice of words in writing a headline has always been a bit of a skill and an art. But with lots of data now we can start being a bit more scientific by looking at what textual and linguistic features of headlines tend to be associated with higher levels of retweets. In the rest of this post I’ll present some data that starts to scratch at the surface of this.

I collected all tweets from the @nytimes twitter account between July 1st, 2011 and Sept. 30th, 2011 using the Topsy API. I wanted to analyze somewhat older tweets to make sure that retweeting had run its natural course and that I wasn’t truncating the retweeting behavior. Using data from only one news account has the advantage that it controls for the network and audience and allows me to focus purely on textual features. In all I collected 5101 tweets, including how many times each tweet was retweeted (1) using the built-in retweet button and (2) using the old syntax of “RT @username”. Of these tweets, 93.7% contained links to NYT content, 1.0% contained links to other content (e.g. yfrog, instagram, or government information), and 0.7% were retweets themselves. The remaining 4.6% of tweets in my sample had no link.

The first thing I looked at was what the average number of retweets was for the tweets in each group (links to NYT content, links to other content, and no links).

  • Average # of RTs for tweets with links to NYT content: 48.0
  • Average # of RTs for tweets with links to other content: 48.1
  • Average # of RTs for tweets with no links: 83.8

This is interesting because some of the best research out there suggests that tweets WITH links get more RTs. But I found just the opposite: tweets with NO LINKS got more RTs (1.74 times as many on average).  I read through the tweets with no links (there’s only 234) and they were mostly breaking news alerts like “Qaddafi Son Arrested…“, “Dow drops more than 400 points…“, or “Obama and Boehner Close to Major Budget Deal…“. So from the prior research we know that for any old tweet source, URLs are a signal that is correlated with RTs, but for news organizations, the most “newsy” or retweetable information comes in a brief snippet, without a link. The implication is not that news organization should stop linking their content to get more RTs, but rather that the kind of information shared without links from news organizations (the NYT in particular) is highly retweetable.

To really get into the textual analysis I wanted to look just at tweets with links back to NYT content though. So the rest of the analysis was done on the 4780 tweets with links back to NYT content. If you look at these tweets they basically take the form: <story headline> + <link>. I broke the dataset up into the top and bottom 10% of tweets (deciles) as ranked by their total number of RTs, which includes RTs using the built-in RT button as well as the old style RTs. The overall average # of RTs was 48.3, but in the top 10% of tweets it was 173 and in the bottom 10% it was 7.4. Here’s part of the distribution:


Is length of a tweet related to how often it gets retweeted? I looked at the average length of the tweets (in characters) in the top and bottom 10%.

  • Top 10%: 75.8 characters
  • Bottom 10%: 82.8 characters

This difference is statistically significant using a t-test (t=5.23, p < .0001). So tweets that are in the top decile of RTs are shorter, on average, by about 7 characters. This isn’t prescriptive, but it does suggest an interesting correlation that headline / tweet writers for news organizations might consider exploring.

I also wanted to get a feel for what words were used more frequently in either the top or bottom deciles. To do this I computed the frequency distribution of words for each dataset (i.e. how many times each unique word was used across all the tweets in that decile). Then for each word I computed a ratio indicating how frequent it was in one decile versus the other. If this ratio is above 1 then it indicates that that word is more likely to occur in one decile than the other. I’ve embedded the data at the end of this post in case you want to see the top 50 words ranked by their ratio for both the top and bottom deciles.

From scanning the word lists you can see that pronouns (e.g. “I, you, my, her, his, he” etc.) are used more frequently in tweets from the bottom decile of RTs. Tweets that were in the top decile of RTs were more likely to use words relating to crime (e.g. “police”, “dead”, “arrest”), natural hazards (“irene”, “hurricane”, “earthquake”), sports (“soccer”, “sox”), or politically contentious issues (e.g. “marriage” likely referring to the legalization of gay marriage in NY). I thought it was particularly interesting that “China” was much more frequent in highly RTed tweets. To be clear, this is just scratching the surface and I think there’s a lot more interesting research to do around this, especially relating to theories of attention and newsworthiness.

The last bit of data analysis I did was to look at whether certain parts of speech (e.g. nouns, verbs, adjectives) were used differently in the top and bottom RT deciles. More specifically I wanted to know: Are different parts of speech used more frequently in one group than the other? To do this, I used a natural language processing toolkit (NLTK) and computed the parts of speech (POS) of all of the words in the tweets. Of course this isn’t a perfect procedure and sometimes the POS tagger makes mistakes, but I consider this analysis preliminary. I calculated the Chi-Square test to see if there was a statistical difference in the frequency of nouns, adverbs, conjunctions (e.g. “and”, “but”, etc.), determiners (e.g. “a”, “some”, “the”, etc.), pronouns, and verbs used in either the top or bottom 10% of RTs. What I found is that there is a strong statistically significant difference for adverbs (p < .02), determiners (p < .001), and verbs (p < .003), and somewhat of a difference for conjunctions (p = .06). There was no difference in usage for adjectives, nouns, or pronouns. Basically what this boils down to is that, in tweets that get lots of RTs, adverbs, determiners (and conjunctions somewhat) are used substantially less, while verbs are used substantially more. Perhaps it’s the less frequent use of determiners and adverbs that (as described above) makes these tweets shorter on average. Again, this isn’t prescriptive, but there may be something here in terms of how headlines are written. More use of verbs, and less use of “empty” determiners and conjunctions in tweets is correlated with higher levels of retweeting. Could it be the case that action words (i.e. verbs) somehow spur people to retweet the headline? Pinning down the causality of this is something I’ll be working on next!

Here are the lists of words I promised. If you find anything else notable, please leave a comment!

Tweaking Your Credibility on Twitter

You want to be credible on social media, right? Well, a paper to be published at the Conference on Computer Supported Cooperative Work (CSCW) in early 2012 from researchers at Microsoft and Carnegie Mellon suggests at least a few actionable methods to help you do so. The basic motivation for the research is that when people see your tweet via a search (rather than following you) they have less cues to assess credibility. With a better understanding of what factors influence tweet credibility, new search interfaces can be designed to highlight the most relevant credibility cues (now you see why Microsoft is interested).

First off, five people were interviewed by the researchers to collect a range of issues that might be relevant to credibility perception. They came up with a list of 26 possible credibility cues and then ran a survey with 256 respondents in which they asked how much each feature impacted credibility perception. You can see the paper for the full results, but, for instance, things like keeping your tweets on a similar topic, using a personal photo, having a username related to the topic, having a location near a topic, having a bio that suggests relavent topical expertise, and frequent tweeting were all perceived by participants to positively impact credibility to some extent. Things like using non-standard grammar and punctuation, using the default user image were seen to detract from credibility.

Based on their first survey, the researchers then focused on three specific credibility cues for a follow-on study: (1) topic of tweets (politics, science, or entertainment), (2) user name style (first_last, internet – “tenacious27”, and topical – “AllPolitics”), and finally (3) user image (male / female photo, topical icon, generic icon, and default). For the study, each participant (there were 266) saw some combination of the above cues for a tweet, and rated both tweet credibility and author credibility. Unsurprisingly tweets about the science topic were rated as more credible than those on politics or entertainment. The most surprising result to me was that topically relevant user names were more credible than traditional names (or internet style names, though that’s not surprising). In a final follow-up experiment the researchers found that the user image doesn’t impact credibility perceptions, except for when the image is the default image in which case it significantly (in the statistical sense) lowers perceptions of tweet credibility.

So here are the main actionable take-aways:

  • Don’t use non standard grammar and punctuation (no “lol speak”)
  • Don’t use the default image.
  • Tweet about topics like science, which seem to carry an aura of credibility.
  • Find a user name that is topically aligned with those you want to reach.
That last point of finding a topically aligned user name might be an excellent strategy for large news organizations to build a more credible presence across a range of topics. For instance, right now the NY Times has a mix of accounts that have topical user names, as well as reporters using their real names. In addition to each reporter having their own “real name” account, individual tweets of theirs that were topically relevant could be routed to the appropriate topically named account. So for instance, let’s say Andy Revkin tweets something about the environment. That tweet should also show up via the Environment account, since the tweet may be perceived as having higher credibility from a topically-related user name. For people who search and find that tweet, of course if they know who Andy Revkin is, then they’ll find his tweet quite credible since he’s known for having that topical expertise. But for someone else who doesn’t know who Andy Revkin is, the results of the above study suggest that that person would find the same content more credible coming from the topically related Environment account. Maybe the Times or others are already doing this. But if not, it seems like there’s an opportunity to systematically increase credibility by adopting such an approach.

Modeling Computing and Journalism (Part I)

Recently I’ve been thinking more about modeling the intersection of computing and journalism, and in particular thinking about ways that aspects of computing might impact or allow for innovation in journalism. It struck me that I needed a more precise definition of computing and its purview (I’ll come back to the journalism side of the equation in a later post). What, exactly, is computing? I’ll try to answer that in this post…

Definitions of computing and computer science abound online, but the most canonical comes perhaps from Peter Denning, an elder in the field of Computer Science. In a CACM article from 2005 he writes, “Computing is the systematic study of algorithmic processes that describe and transform information”. Two key words there: “algorithmic” and “information”. Computing is about information, about describing and transforming it, but also about acquiring, representing, structuring, storing, accessing, managing, processing, manipulating, communicating, and presenting it. And computing is about algorithms: their theory, feasibility, analysis, structure, expression, and implementation. The fundamental question of computing concerns what information processes can be effectively automated.

In modern CS there is a huge body of knowledge that stems from this core notion of computing. For instance, the Computer Science Curriculum defined in 2008 defines 14 different areas of knowledge (see list below). The Georgia Tech College of Computing delineates some of these areas as belonging to core computer science, and others belonging to interactive computing. Roughly, core computer science deals with the conceptual (i.e. mathematical), and operational (i.e nuts and bolts of how a modern computer works) aspects of computing. Interactive computing on the other hand mostly deals with information input, modeling, and output. There are aspects of professional practice, engineering, and design that apply in both.

Core Computer Science

  • Discrete Structures, Programming Fundamentals, Software Engineering, Algorithms and Complexity, Architecture and Organization, Operating Systems, Programming Languages, Net Centric Computing, Information Management, Computational Science

Interactive Computing

  • Human Computer Interaction, Graphics and Visual Computing, Intelligent Systems

In terms of modeling the intersection of computing and journalism it’s the interactive side of things that’s most interesting. How information is moved around inside a computer is less important for journalists to understand than the interactive capabilities of information input, modeling, and output afforded by computing.  That is, how does computing interface with the rest of the world? Of course many of the capabilities of computers studied in interactive computing rest on solid foundations of core computer science (e.g. you couldn’t get much done without an operating system to schedule processes and manage data). Core areas with particular relevance to interactive computing are technologies in networking/communications, information management, and to a lesser extent computational science. Below I list more detailed sub-areas for each of the interactive computing and related core areas.

  • Human Computer Interaction (HCI) includes sub-areas such as interaction design, user-centered design, multimedia systems, collaboration, online communities, human-robot interaction, natural interaction, tangible interaction, mobile and ubiquitous computing, wearable computing, and information visualization
  • Graphics and Visual Computing includes sub-areas such as geometric modeling, materials modeling and simulation, rendering, image synthesis, non-photorealistic rendering, volumetric rendering, animation, motion capture, scientific visualization, virtual environments, computer vision, image processing and editing, game engines, and computational photography
  • Intelligent Systems includes sub-areas such as general AI including search and planning, cognitive science, knowledge-based reasoning, agents, autonomous robotics, computational perception, machine learning, natural language processing and understanding, machine translation, speech recognition, and activity recognition
  • Net Centric Computing includes aspects of networking, web architecture, compression, and mobile computing.
  • Information Management includes aspects of database systems, information architecture, query languages, distributed data, data mining, information storage and retrieval, hypermedia, and multimedia databases.
  • Computational Science includes aspects of modeling, simulation, optimization, and parallel computing often oriented towards big data sets.

So what can we do with this detailed typology of interactive computing technology?

In a 2004 CACM article Paul Rosenbloom developed a notation for describing how computing interacts with other fields. In his typology, he articulated ways in which computing could implement, interact with, and embed with other disciplines, namely with physical, life, and social sciences. These different relationships between fields lead to different kinds of ideas for technology (e.g. an embedding relationship of computing in life sciences would be the notion of cyborgs, an interaction between computing and physical sciences would be robotics). In this spirit, later on in this blog series I’ll look more specifically at how some of the computing technologies articulated above can map to aspects of journalism practice, with an eye toward innovation in journalism by applying computing in new or under-explored ways.

Is data.gov creating jobs?

The recent announcement that Data.gov might shut down due to budget cut-backs got me thinking about whether or not open-data was really all worth it. Clive Thompson had just written an essay in Wired a few days earlier in late March where he argued the economic merits of opening government data. The argument was largely repeated by RWW a few days later. Put the data out there and companies will add value and resell services – so the argument goes.

But what are some of the stumbling blocks to realizing this Utopian vision of open-data equaling new information jobs?

I found a recent report out of Europe which does a comparison of open data strategies among five European countries. What was the main take-away from that report? “Many policy makers also recognize that the precise economic impact of open data for their country, and specific sectors or organizations, remains largely unclear.” Bummer, so what do we do now? The major barriers cited in the report include a closed government culture, limited data quality, and yes uncertainty about the economic impact. Sorry Mr. Thompson, but your anecdotal evidence just doesn’t seem to be convincing the policy-makers just yet. From some of my own anecdotal conversations with journalists there’s a recurring complaint that not a lot of data of value is actually put on data.gov. Why not?

Maybe the data that needs to be on data.gov to create those new jobs we all want  really isn’t there at all. Reading Janet Vertesi’s new article on The Value of Data, may offer some clues on that. In her article, she argues that how data is collected influences how that data is shared downstream. As some data becomes commodified we sometimes forget that data (including government data) is often produced through a set of social processes (e.g. sampling strategies like selecting when to record and what to focus on). How that data is collected, the culture, and norms of its recording are important to how it becomes shared. It’s quite possible that the most valuable government data assets just aren’t being shared either because (1) the culture of their production does not easily allow for it, or (2) there’s money to be made by the government itself reselling those assets.

We need to think about the modes of production more when designing future open-data portals: the value of that sharing, and its ultimate impact may depend on it.

Game-y Information Graphics: Salubrious Nation

Inspired by the recent Design for America contest, I’ve been advancing the notion of Game-y Information Graphics with Rutgers Ph.D. student Funda Kivran-Swaine.¬† Using data published by the HHS Community Health Data Initiative we designed a game-y info graphic called Salubrious Nation. The idea is pretty simple really: we’re exploring the application of aspects of game design such as goals, scores, and advancement to news and information graphics. How do users understand an info graphic differently when it’s presented as a game? Do they have different insights? Do they explore more of the underlying data which drives the graphic?

Screen shot 2010-06-07 at 4.50.14 PM“Salubrious” as we call it asks users to guess the community health of counties across the nation by using hints such as demographic data and map-based visual feedback to inform their guesses. Heavily data-driven. The closer the player’s guess to the actual data, the more points they get. The fun of it is in trying to use the statistics that are revealed (things like poverty rate, life expectancy, and unemployment rate) to guess the hidden data (such as obesity, smoking, air pollution etc.) At the end of a series of increasingly difficult “levels” the player can see how they stack up against other people who have completed the game. Try it out here.

One nice aspect of this approach is that as we develop more game mechanics that fit well in the genre, different data sources can be plugged in very easily. Another hope is such game-y presentations of data will encourage users to engage more deeply with the content. We’ll be assessing these properties of the medium more formally this summer and hope to report the results soon!