Category Archives: visualization

OpenVis is for Journalists!

Note: A version of the following also appears on the Tow Center blog.

Last week I attended the OpenVis Conference in Boston, a smorgasbord of learning dedicated to exploring the use and application of data visualization on the open web, so basically not using proprietary standards. It was hard not to get excited, with a headline keynote like Mike Bostock, the original creator of the popular D3 library for data visualization and now a graphics editor at the New York Times.

Given that news organizations are leading the way with online data storytelling, it was perhaps unsurprising to find a number of journalists presenting at the conference. Kennedy Elliot of the Washington Post talked about coding for the news, imploring attendees to think more like journalists. And we also heard from Lisa Strausfeld and Christopher Cannon who run the new Bloomberg Visual Data lab, and from Lena Groeger at ProPublica who spoke about “thinking small” in visualization.

But even the less overtly journalistic talks somehow seemed to have strong ties and implications for journalism, on everything from storytelling and authoring tools to analytic methodologies. Let me pick on just a few talks that exposed some particularly relevant implications for data journalism.

First up, David Mimno, a professor at Cornell, gave a tour of his work in visualizating machine learning algorithms online to help students learn how those algorithms are working. He demonstrated old classics like k-means and linear regression, but the algorithms were palpable seeing them come to life through animated visualizations. Another example of this comes from the machine learning demos page, which animates and presents an even greater number of algorithms. Where I think this gets really important for journalists is with the whole idea of algorithmic accountability, and the ability to use visualization as a way for journalists to be transparent about the algorithms they use in their reporting.

A good example of where this is already happening is the explanation of the NYT4thDownBot where authors Brian Burke and Kevin Queally use a visualization of a football field (shown below) to explain how their predictive model differs from what actual football coaches tend to do. To the extent that algorithms are deserving of our scrutiny, visualization methods to communicate what they are doing and to somehow make them legible to the public seems incredibly powerful and important for us to work more on.

Alexander Howard recently wrote about “the difficult, complicated process of reporting on data as a source” while being as open and transparent as possible. If there’s one thing the recent launch of 538 has taught us is that there’s a need (and demand) to make the data, and even the code or models, available for data journalism projects.

People are already developing workflows and tools to make this possible online. Another great talk at OpenVis was by Dr. Jake Vanderplas, an astrophysicist working at the University of Washington, who has developed some really amazing open source technology that lets you create interactive D3 visualizations in the browser directly from IPython notebooks. Jake’s work on visualization takes us one step closer to enabling a complete end-to-end workflow for data journalists: data, analysis, and code can sit in the browser and directly render interactive visualizations for the end user. The whole stack is transparent and could potentially even enable the user to tweak, tune, or test variations. To the extent that reproducibility of data journalism projects becomes important to maintain the trust of the audience these sorts of platforms are certainly worth learning more about.

Because of its emphasis on openness and the relationship to transparency and the desire to create news content online, expect OpenVis to continue to develop next year as a key destination for journalists looking to learn more about visualization.

Making Data More Familiar with Concrete Scales

Note: A version of the following also appears on the Tow Center blog.


As part of their coverage of the Snowden leaks, last month the Guardian published an interactive to help explain what the NSA data collection activities mean for the public. Above is a screenshot of part of the piece. It allows the user to input the number of friends they have on Facebook and see a typical number of 1st, 2nd (friends-of-friends), and 3rd (friends-of-friends-of-friends) degree connections as compared to places where you typically find different numbers of people. So 250 friends is more than the capacity of a subway car, 40,850 friends-of-friends is more than would fit in Fenway Park, and 6.7 million 3rd degree connections is bigger than the population of Massachusetts.

When we tell stories with data it can be hard for readers to grasp units or measures that are outside of normal human experience or outside of their own personal experience. How much *is* 1 trillion dollars, or 200 calories, really? Unless you’re an economist, or a nutritionist respectively it might be hard to say. Abstract measures and units can benefit from making them more concrete. The idea behind the Guardian interactive was to take something abstract, like a big number of people, and compare it to something more spatially familiar and tangible to help drive it home and make it real.

Researchers Fanny ChevalierRomain Vuillemot, and  Guia Gali have been studying the use of such concrete scales in visualization and recently published a paper detailing some of the challenges and practical steps we can use to more effectively employ these kinds of scales in data journalism and data visualization.

In the paper they describe a few different strategies for making concrete scales, including unitization, anchoring, and analogies. Shown in the figure below, (a) unitization is the idea of re-expression one object in terms of a collection of objects that may be more familiar (e.g. the mass of Saturn is 97 times that of Earth); (b) anchoring uses a familiar object, like the size of a match head, to make the size of another unfamiliar object (e.g. a tick in this case) more concrete; and (c) analogies make parallel comparisons to familiar objects (e.g. atom is to marble, as human head is to earth).

All of these techniques are really about visual comparison to the more familiar. But the familiar isn’t necessarily exact. For instance, if I were to compare the height of the Empire State Building to a number of people stacked up, I would need to use the average height of a person, which is really an idealized approximation. So it’s important to think about the precision of the visual comparisons you might be setting up with concrete scales.

Another strategy often used with concrete scales is containment, which can be useful to communicate impalpable volumes or collections of material. For example you might want to make visible the amount of sugar in different sizes of soda bottles by filling plastic bags with different amounts of granular sugar. Again, this is an approximate comparison but also makes it more familiar and material.

So, how can you design data visualizations to effectively use concrete scales? First you should ask if it’s an unfamiliar unit or whether it has an extreme magnitude that would make it difficult to comprehend. Then you need to find a good comparison unit that is more familiar to people. Does it make sense to unitize, anchor, or use an analogy? And if you use an anchor or container, which one should you choose? The answers to these questions will depend on your particular design situation as well as the semantics of the data you’re working with. A number of examples that the researchers have tagged are available online.

The individual nature of “what is familiar” does beg the question about personalization of concrete scales too. Michael Keller’s work for Al Jazeera lets you compare the number of refugees from the Syrian conflict to a geographic extent in the US, essentially letting the user’s own familiarity with geography guide what area they want to compare as an anchor. What if this type of personalization could also be automated? Consider logging into Facebook or Twitter and the visualization adapting to use concrete scales to the places, objects, or organizations you’re most familiar with based on your profile information. This type of automated visualization adaptation could help make such visual depictions of data much more personally relevant and interesting.

Even though concrete scales are often used in data visualizations in the media it’s worth also realizing that there are some open questions too. How do we define whether an anchor or unit is “familiar” or not, and what makes one concrete unit better than another? Perhaps some scales make people feel like they understand the visualization better or help the reader remember the visualization better. There are still many open questions for empirical research.

Storytelling with Data Visualization: Context is King

Note: A version of the following also appears on the Tow Center blog.

Data is like a freeze-dried version of reality, abstracted sometimes to the point where it can be hard to recognize and understand. It needs some rehydration before it becomes tasty (or even just palatable) storytelling material again — something that visualization can often help with. But to fully breathe life back into your data, you need to crack your knuckles and add a dose of written explanation to your visualizations as well. Text provides that vital bit of context layered over the data that helps the audience come to a valid interpretation of what it really means.

So how can you use text and visualization together to provide that context and layer a story over your data? Some recently published research by myself and collaborators at the University of Michigan offers some insights.

In most journalistic visualization, context is added to data visualization through the use of labels, captions, and other annotations — texts — of various kinds. Indeed, on the Economist Graphic Detail blog, visualizations not only have integrated textual annotations, but an entire 1-2 paragraph introductory article associated with them. In addition to adding an angle and story to the piece, such contextual journalism helps flesh out what the data means and guides the reader’s interpretation towards valid inferences from the data. Textual annotations integrated directly with a visualization can further guide the users’ interactions, emphasizing certain points, prioritizing particular interpretations of data, or pre-empting the user’s curiosity on seeing a salient outlier, aberration, or trend.

To answer the question of how textual annotations function as story contextualizers in online news visualization we analyzed 136 professionally made news visualizations produced by the New York Times and the Guardian between 2000 and July 2012. Of course we found text used for everything from axes labels, author information, sources, and data provenance, to instructions, definitions, and legends, but we were were less interested in studying these kinds of uses than in annotations that were more related to data storytelling.

Based on our analysis we recognized two underlying functions for annotations: (1) observational, and (2) additive. Observational annotations provide context by supporting reflection on a data value or group of values that are depicted in the visualization. These annotations facilitate comparisons and often highlight or emphasize extreme values or other outliers. For interactive graphics they are sometimes revealed when hovering over a visual element.

A basic form of observational messaging is apparent in the following example from the New York Times, showing the population pyramid in the U.S. On the right of the graphic text clearly indicates observations of the total number and fraction of the population expected to be over age 65 by 2015. This is information that can be observed in the graph but is being reinforced through the use of text.

Another example from the Times shows how observational annotations can be used to highlight and label extremes on a graph. In the chart below, the U.S. budget forecast is depicted, and the low point of 2010 is highlighted with a yellow circle together with an annotation. The value and year of that point are already visible in the graph, which is what makes this kind of annotation observational. Consider using observational annotations when you want to underscore something that’s visible in the visualization, but which you really want to make sure the user sees, or when there is an interesting comparison that you would like to draw the user’s attention towards.

On the other hand, additive annotation provides context that is external to the visual representation and not clearly depicted via the data. These are things that are relevant to the topic or to understanding the data, like background or contemporaneous events or actions. It’s up to you to decide which dimensions of who, what, where, when, why, and how are relevant. If you think the viewer needs to be aware of something in order to interpret the data correctly, then an additive annotation might be appropriate.

The following example from The Minneapolis Star Tribune shows changes in home prices across counties in Minnesota with reference to the peak of the housing bubble, a key bit of additive annotation attached to the year 2007. At the same time, the graphic also uses observational annotation (on the right side) by labeling the median home price and percent change since 2007 for the selected county.

Use of these types of annotation is very prevalent; in our study of 136 examples we found 120 (88.2%) used at least one of these forms of annotation. We also looked at the relative use of each, shown in the next figure. Observational annotations were used in just shy of half of the cases, whereas additive were used in 73%.

Another dimension to annotation is what scope of the visualization is being referenced: an individual datum, a group of data, or the entire view (e.g. a caption-like element). We tabulated the prevalence of these annotation anchors and found that single datum annotations are the most frequently used (74%). The relative usage frequencies are shown in the next figure. Your choice of what scope of the visualization to annotate will often depend on the story you want to tell, or on what kinds of visual features are most visually salient, such as outliers, trends, or peaks. For instance, trends that happen over longer time-frames in a line-graph might benefit from a group annotation to indicate how a collection of data points is trending, whereas a peak in a time-series would most obviously benefit from an annotation calling out that specific data point.

The two types of annotation, and three types of annotation anchoring are summarized in the following chart depicting stock price data for Apple. Annotations A1 and A2 show additive annotations attached to the whole view, and to a specific date in the view, whereas O1 and O2 show observational annotations attached to a single datum and a group of data respectively.

As we come to better understand how to tell stories with text and visualization together, new possibilities also open up for how to integrate text computationally or automatically with visualization.

In our research we used the above insights about how annotations are used by professionals to build a system that analyzes a stock time series (together with its trade volume data) to look for salient points and automatically annotate the series with key bits of additive context drawn from a corpus of news articles. By ranking relevant news headlines and then deriving graph annotations we were able to automatically generate contextualized stock charts and create a user-experience where users felt they had a better grasp of the trends and oscillations of the stock.

On one hand we have the fully automated scenario, but in the future, more intelligent graph authoring tools for journalists might also incorporate such automation to suggest possible annotations for a graph, which an editor could then tweak or re-write before publication. So not only can the study of news visualizations help us understand the medium better and communicate more effectively, but it can also enable new forms of computational journalism to emerge. For all the details please see our research paper, “Contextifier: Automatic Generation of Annotated Stock Visualizations.”

Storytelling with Data: What Are the Impacts on the Audience?

Storytelling with data visualization is still very much in its “Wild West” phase, with journalism outlets blazing new paths in exploring the burgeoning craft of integrating the testimony of data together with compelling narrative. Leaders such as The News York Times create impressive data-driven presentations like 512 Paths to the White House (seen above) that weave complex information into a palatable presentation. But as I look out at the kinds of meetings where data visualizers converge, like EyeoTapestryOpenVis, and the infographics summit Malofiej, I realize there’s a whole lot of inspiration out there, and some damn fine examples of great work, but I still find it hard to get a sense of direction — which way is West, which way to the promised land?

And it occurred to me: We need a science of data-visualization storytelling. We need some direction. We need to know what makes a data story “work”. And what does a data story that “works” even mean?

Examples abound, and while we have theories for color use, visual salience and perception, and graph design that suggest how to depict data efficiently, we still don’t know, with any particular scientific rigor, which are better stories. At the Tapestry conference, where I attended, journalists such as Jonathan CorumHannah Fairfield, and Cheryl Phillips whipped out a staggering variety of examples in their presentations. Jonathan, in his keynote, talked about “A History of the Detainee Population” an interactive NYT graphic (partially excerpted below) depicting how Guantanamo prisoners have, over time, slowly been moved back to their country of origin. I would say that the presentation is effective. I “got” the message. But I also realize that, because the visualization is animated, it’s difficult to see the overall trend over time — to compare one year to the next. There are different ways to tell this story, some of which may be more effective than others for a range of storytelling goals.


Critical blogs such as The Why Axis and Graphic Sociology have arisen to try to fill the gap of understanding what works and what doesn’t. And research on visualization rhetoric has tried to situate narrative data visualization in terms of the rhetorical techniques authors may use to convey their story. Useful as these efforts are in their thick description and critical analysis, and for increasing visual literacy, they don’t go far enough toward building predictive theories of how data-visualization stories are “read” by the audience at large.

Corum, a graphics editor at NYT, has a descriptive framework to explain his design process and decisions. It describes the tensions between interactivity and story, between oversimplification and overwhelming detail, and between exploration and decoration. Other axes of design include elements such as focus versus depth and the author versus the audience. Author and educator Alberto Cairo exhibits similar sets of design dimensions in his book, “The Functional Art“, which start to trace the features along which data-visualization stories can vary (recreated below).

vis wheel

Such descriptions are a great starting point, but to make further progress on interactive data storytelling we need to know which of the many experiments happening out in the wild are having their desired effect on readers. Design decisions like how and where annotations are placed on a visualization, how the story is structured across the canvas and over time, the graphical style including things like visual embellishments and novelties, as well as data mapping and aggregation can all have consequences on how the audience perceives the story. How does the effect on the audience change when modulating these various design dimensions? A science of data-visualization storytelling should seek to answer that question.

But still the question looms: What does a data story that “works” even mean? While efficiency and parsimony of visual representation may still be important in some contexts, I believe online storytelling demands something else. What effects on the audience should we measure? As data visualization researcher Robert Kosara writes in his forthcoming IEEE Computer article on the subject, “there are no clearly defined metrics or evaluation methods … Developing these will require the definition of, and agreement on, goals: what do we expect stories to achieve, and how do we measure it?”

There are some hints in recent research in information visualization for how we might evaluate visualizations that communicate or present information. We might for instance ask questions about how effectively a message is acquired by the audience: Did they learn it faster or better? Was is memorable, or did they forget it 5 minutes, 5 hours, or 5 weeks later? We might ask whether the data story spurred any personal insights or questions, and to what degree users were “engaged” with the presentation. Engaged here could mean clicks and hovers of the mouse on the visualization, how often widgets and filters for the presentation were touched, or even whether users shared or conversed around the visualization. We might ask if users felt they understood the context of the data and if they felt confident in their interpretation of the story: Did they feel they could make an informed decision on some issue based on the presentation? Credibility being an important attribute for news outlets, we might wonder whether some data story presentations are more trustworthy than others. In some contexts a presentation that is persuasive is the most important factor. Finally, since some of the best stories are those that evoke emotional responses, we might ask how to do the same with data stories.

Measuring some of these factors is as straightforward as instrumenting the presentations themselves to know where users moved their mouse, clicked, or shared. There are a variety of remote usability testing services that can already help with that. Measuring other factors might require writing and attaching survey questions to ask users about their perceptions of the experience. While the best graphics departments do a fair bit of internal iteration and testing it would be interesting to see what they could learn by setting up experiments that varied their designs minutely to see how that affected the audience along any of the dimensions delineated above. More collaboration between industry and academia could accelerate this process of building knowledge of the impact of data stories on the audience.

I’m not arguing that the creativity and boundary-pushing in data-visualization storytelling should cease. It’s inspiring looking at the range of visual stories that artists and illustrators produce. And sometimes all you really want is an amuse yeux — a little bit of visual amusement. Let’s not get rid of that. But I do think we’re at an inflection point where we know enough of the design dimensions to start building models of how to reliably know what story designs achieve certain goals for different kinds of story, audience, data, and context. We stand only to be able to further amplify the impact of such stories by studying them more systematically.

Review: The Functional Art

I don’t often write reviews of books. But I can’t resist offering some thoughts on The Functional Art, a new book by Alberto Cairo aimed at teaching the basics of information graphics and visualization, mostly because I think it’s fantastic, but also because I think there are a few areas where I’d like to see a future edition expound.

Basically I see this as the new default book for teaching journalists how to do infographics and visualization. If you’re a student of journalism, or just interested in developing better visual communication skills I think this book has a ton to offer and is very accessible. But what’s really amazing is that the book also offers a lot to people already in the field (e.g. designers or computer scientists) who want to learn more about the journalistic perspective on visual storytelling. There are nuggets of wisdom sprinkled throughout the book, informed by Cairo’s years of journalism experience. And the diagrams and models of thinking about things like the designer-user relationships or dimensions along which graphics vary adds some much needed structure that forms a framework for thinking about and characterizing information graphics.

Probably the most interesting aspect of the book for someone already doing or studying visualization is the last set of chapters which detail, through a series of interviews with practitioners, how “the sausage is made.” Exposing process in this way is extremely valuable for learning how these things get put together. This exposition continues on the included DVD in which additional production artifacts, sketchs, and mockups form a show-and-tell. And it’s not just about artifacts; the interviews also explore things like how teams are composed in order to facilitate collaborative production.

One of the things I appreciated most about the book is that, in light of its predominant focus on practice, Cairo fearlessly  reads into and then translates research results into practical advice, offering an evidence-based rationale for design decisions. We need more of that kind of thinking, for all sorts of practices.

I have only a few critiques of the book. The first is straightforward: I wish that the book was printed in a larger format because some of the examples shown in the book are screaming for more breathing space. I would have also liked to see the computer science perspective represented a bit more thoroughly in the book – this can for instance serve to enhance and add depth to the discussion about interactivity with visualizations. My only other critique of the book is about critique itself. What I mean is that the idea of critique is sprinkled throughout the book, but I’d almost like to see it elevated to the status of having its own chapter. Learning the skills of critique and the thought process involved is an essential aspect of learning to be a graphics communication intellectual and thoughtful practitioner. And it can and should be taught in a way that students learn a systematic way for thinking and analyzing benefits and tradeoffs. Cairo has the raw material to do this in the book, but I wish it were formalized in some way that lent it the attention it deserves. Such a method could even be illustrated using some of the interviewees’ many examples.


Visualization Performance in the Browser

I’ve recently embarked on a new project that involves visualizing and animating some potentially large networks as part of a browser-based information tool. So, I wanted to compare some of the different javascript visualization libraries out there to see how their performance scales. There are tons of options for doing advanced graphics in the browser nowadays including SVG-based solutions like D3, and Raphael, as well as HTML5 canvas solutions like processing.js, the javascript infovis toolkit, sigma.js and fabric.js.

There are certain benefits and trade-offs between SVG and Canvas. For instance canvas has performance that scales with the size of the image area. SVG performance instead scales with the complexity and size of the scenegraph. It also allows for control of elements via the DOM and CSS and has much better support for interactivity (i.e. every visual object can have event listeners). This sketch from D3 creator Mike Bostock shows that D3 performance can render 500 animated circles in SVG at a resolution of 960×500 at about ~40 FPS in Chrome, whereas rendering the same via the Canvas element was closer to ~30 FPS. Knowing what we know about how canvas scales, if the image area were less than 960 x 500, then canvas performance would increase, whereas SVG performance would not change. Of course, your mileage may vary depending on your browser and system – for instance this post found that processing.js (using canvas) outperformed D3 (using SVG) by 20-1000%.

To get a better feel for some of the performance trade-offs (and to take some of the different libraries for a test spin) I developed a quick comparison tool which lets you see performance for D3 (SVG), Sigma.js, Processing.js, and D3 (rendering to canvas) for different graph sizes (500-5,000 nodes, and 1,000-10,000 edges) on an image area of 600×600 pixels. On my system (MBP 2.4GHz, Chrome v.18) D3 (SVG) choked down to about 7 FPS with 1000 nodes and 2000 edges when 20% of nodes’ colors were gradually animated. For the same rig sigma.js could do 19 FPS and processing.js could do 11 FPS. Using D3 but then rendering to canvas did the best though: 23 FPS.

D3 seems like a great option given the rich set of utilities and functions available, as well as the option to efficiently render directly to canvas if you really need to scale up the number of objects in your scene. Of course this does undo some of the nice interactivity and manipulability features of using SVG …


Authoring Data-Driven Documents

Over the last few months I’ve been learning D3 (Data-Driven Documents), which is a really powerful data visualization library built for javascript. The InfoVis paper gets to the gritty details of how it supports data transformations, immediate evaluation of attributes, and a native SVG representation. These features can be more or less helpful depending on what kind of visualization you’re working on. For instance, transformations don’t really matter if you’re just building static graphs. But being able to inspect the SVG representation of your visualization (and edit it in the console) is really quite helpful and powerful.

But for all the power that D3 affords, is programming really how we should be (want to be?) authoring visualizations?

Here’s something that I recently made with D3. It’s a story about U.S. manufacturing productivity, employment, and automation told across a series of panels programmed using D3.

Now, of course, the exploratory data analysis, storyboarding, and research needed to tell this story were time-consuming. But after all that, using D3 to render the graphs I wanted was substantially more tedious and time-consuming than I would have liked. I think this was because (1) my knowledge of SVG is not fantastic and I’m still learning that, but more importantly (2) D3 supports very low-level operations that make high level activities for basic data storytelling time-consuming to implement. And yes, D3 does provide a number of helper modules and layouts, but these aren’t documented with clear examples using concrete data that would make it obvious how to easily utilize them. Having support for the library on jsFiddle, together with some very simple examples would go a long way towards helping noobs (like me!) ramp up.

But, really, where’s the flash-like authoring tool of data visualization? Such a tool could be used to interactively manipulate a D3 visualization and, when you’re done, output HTML + CSS + D3 code to generate your graphs (including animation, transitions, etc.). The tool would also include basic graph templates that could be populated with your data and customized. Basic storytelling functions for highlighting important aspects or comparisons of the data (e.g. through animation, color, juxtaposition, etc.), or using text to annotate and explain the data could also be supported. D3 suffers from a bit of a usability problem right now, and powerful as it is, authoring stories with visualization doesn’t need to be, nor should it be, bound up in programming.

Unpacking Visualization Rhetoric

Note: An edited version of the following also appears on the blog. 

Visualization can be useful for both more exploratory purposes (e.g. generating analyses and insights based on data) as well as more communicative ends (e.g. helping other people understand and be persuaded or informed by the insights that you’ve uncovered). Oftentimes more general visualization techniques are used in the exploratory phase, whereas more specific, tailored, and hand-crafted techniques (like infographics) tend to be preferred for maximal persuasive potential in the communicative phase.

In the middle ground is a class of visualizations termed “narrative visualization” – often used in journalism contexts – which tend to include aspects of both exploratory and communicative visualization. This blending of techniques makes for an interesting domain of study and it’s here where Jessica Hullman and I began investigating how different rhetorical (persuasive) techniques are employed in visualization. We were particularly interested in how different rhetorical techniques can be used to affect the interpretation of a visualization – valuable knowledge for visualization designers hoping to influence and mold the interpretation of their audience. (Here we defer the sticky ethical question of whether someone should use these techniques since in general they can be used for both good and ill).

We carefully analyzed 51 narrative visualizations and constructed a taxonomy of rhetorical techniques we found being used. We observed rhetorical techniques being employed at four different editorial layers of a visualization: data, visual representation, annotations, and interactivity. Choices at any of these layers can have important implications for the ultimate interpretation of a visualization (e.g. the design of available interactivity can direct or divert attention). The five main classes of rhetoric we found being used include: information access (e.g. how data is omitted or aggregated), provenance (e.g. how data sources are explained and how uncertainty is shown), mapping (e.g. the use of visual metaphor), linguistic techniques (e.g. irony or apostrophe), and procedural rhetoric (e.g. how default views anchor interpretation).

The maxim “know thy audience” points to another dimension by which a visualization creator can influence the interpretation of a visualization. While most visualizations concentrate on the denotative level of communication, the most effective visualization communicators also make use of the connotative level of communication to unlock a whole other plane of interpretation. For instance, various cultural codes (e.g. what colors mean), or conventions (e.g. line graphs suggest you’re looking at temporal data even if you’re not) can suggest alternate or preferred interpretations.

While the full explanation of the taxonomy and use of codes and connotation for communication in visualization is beyond this blog post, you can see a more complete discussion in a pre-print of our forthcoming InfoVis paper.  At the very least though I’ll leave you with an example which illustrates some of these concepts.

Take the following recent example from the New York Times where various aspects of the visualization rhetoric framework apply.

The choice of labeling on the dimensions of the chart “reduce spending” vs. “don’t reduce spending” leaves out another option, “increase spending”. The choice of the color green for “willing to compromise” connotes a certain value judgement (i.e. “go, or move ahead”) as read from an American perspective. The way individual squares are aggregated to arrive at an overall color is unclear, leading to questions that could be clarified through better use of provenance rhetoric. Moreover, squares cannot be disaggregated or understood as individual data, making it difficult for users to interpret either the magnitude of the response or the specific data reported in any one square. While compelling, applying the visualization rhetoric framework during the design of this visualization could have suggested other ways to make the interpretation of the visualization more clear.

Ultimately visualization rhetoric is a framework that can be useful for designers hoping to maximize the communicative potential of a visualization. Exploratory visualization platforms (like Tableau or could also be enhanced with an awareness of visualization rhetoric, by, for instance, allowing users to make salient use of certain rhetorical techniques when the time comes to share a visualization.

Those particularly interested in this space should consider participating in an upcoming workshop I am co-organizing on “Telling Stories with Data” at InfoVis 2011 in Providence, RI in late October.

Visualization, Data, and Social Media Response

I’ve been looking into how people comment on data and visualization recently and one aspect of that has been studying the Guardian’s Datablog. The Datablog publishes stories of and about data, oftentimes including visualizations such as charts, graphs, or maps. It also has a fairly vibrant commenting community.

So I set out to gather some of my own data. I scraped 803 articles from the Datablog including all of their comments. Of this data I wanted to know if articles which contained embedded data tables or embedded visualizations produced more of a social media response. That is, do people talk more about the article if it contains data and/or visualization? The answer is yes, and the details are below.

While the number of comments could be scraped off of the Datablog site itself I turned to Mechanical Turk to crowdsource some other elements of metadata collection: (1) the number of tweets per article, (2) whether the article has an embedded data table, and (3) whether the article has an embedded visualization. I did a spot check on 3% of the results from Turk in order to assess the Turkers’ accuracy on collecting these other pieces of metadata: it was about 96% overall, which I thought was clean enough to start doing some further analysis.

So next I wanted to look at how the “has visualization” and “has table” features affect (1) tweet volume, and (2) comment volume. There are four possibilities: the article has (1) a visualization and a table, (2) a visualization and no table, (3) no visualization and a table, (4) no visualization and no table. Since both the tweet volume and comment volume are not normally distributed variables I log transformed them to get them to be normal (this is an assumption of the following statistical tests). Moreover, there were a few outliers in the data and so anything beyond 3 standard deviations from the mean of the log transformed variables was not considered.

For number of tweets per article:

  1. Articles with both a visualization and a table produced the largest response with an average of 46 tweets per article (N=212, SD=103.24);
  2. Articles with a visualization and no table produced an average of 23.6 tweets per article (N=143, SD=85.05);
  3. Articles with no visualization and a table produced an average of 13.82 tweets per article (N=213, SD=42.7);
  4. And finally articles with neither visualization nor table produced an average of 19.56 tweets per article (N=117, SD=86.19).

I ran an ANOVA with post-hoc Bonferroni tests to see if these means were significant. Articles with both a visualization and a table (case 1) have a significantly higher number of tweets than cases 3 (p < .01) and 4 (p < .05). Articles with just the visualization and no data table have a higher number of average tweets per article, but this was not statistically significant. The take away is that it seems that the combination of a visualization and a data table drives a significantly higher twitter response.

Results for number of comments per article are similar:

  1. Articles with both a visualization and a table produced the largest response with an average of 17.40 comments per article (SD=24.10);
  2. Articles with a visualization and no table produced an average of 12.58 comments per article (SD=17.08);
  3. Articles with no visualization and a table produced an average of 13.78 comments per article (SD=26.15);
  4. And finally articles with neither visualization nor table produced an average of 11.62 comments per article (SD=17.52)

Again with the ANOVA and post-hoc Bonferroni tests to assess statistically significant differences between means. This time there was only one statistically significant difference: Articles with both a visualization and a table (case 1) have a higher number of comments than articles with neither a visualization nor a table (case 4). The p value was 0.04. Again, the combination of visualization and data table drove more of an audience response in terms of commenting behavior.

The overall take-away here is that people like to talk about articles (at least in the context of the audience of the Guardian Datablog) when both data and visualization are used to tell the story. Articles which used both had more than twice the number of tweets and about 1.5 times the number of comments versus articles which had neither. If getting people talking about your reporting is your goal, use more data and visualization, which, in retrospect, I probably also should have done for this blog post.

As a final thought I should note there are potential confounds in these results. For one, articles with data in them may stay “green” for longer thus slowly accreting a larger and larger social media response. One area to look at would be the acceleration of commenting in addition to volume. Another thing that I had no control over is whether some stories are promoted more than others: if the editors at the Guardian had a bias to promote articles with both visualizations and data then this would drive the audience response numbers up on those stories too. In other words, it’s still interesting and worthwhile to consider various explanations for these results.

Balance and Challenge in Playable Data

Note: A version of this will appear at the CHI2011 Gamification Workshop.


Work published this year at CHI has introduced the notion of game-y information graphics which take raw datasets from sources such as and create playable visualizations by adding elements of goals, rules, rewards, and mechanics of play. One example is Salubrious Nation, which uses geographically tagged public health data such as smoking and obesity rates, to create a guessing game. The goal of the game is to accurately guess the magnitude of the given health parameter for a randomly selected target county. A player’s guess can be informed by looking at the map (See screenshot below) for visual clues as a slider is changed, or by using hover-over information on correlated variables (e.g. poverty rate or elderly population rate).

In addition to allowing players to use the map-based graphic to arrive at insights about the data and to redistributing players attention to different aspects of the data, such an approach also offers the promise of reducing the amount of effort needed to repurpose that data into new playable experiences. Interested readers can see the paper for all of the details.

In the remainder of this post, however, I would like to expound on and explore the design difficulty associated with creating a challenging and balanced game experience when drawing on raw datasets as input for the construction of a game. Ordinarily when designing games, substantial effort is directed to level design. In fact, many games employ dedicated level designers who work with the game designer in order to provide the right amount of challenge, reward, and balance to the game experience (See Game Design Workshop for more details).

In contrast to such heavily authored experiences, gamified data experiences (whether they be based on infographics as in Salubrious Nation, or not), may draw on data that is incomplete, inconsistent, or dynamic. For instance, if a dataset is missing values, such missing values must be taken into account so that this does not completely break the game, or at least does not substantially reduce the engagement of the experience. Salubrious Nation relies on correlations between health variables to demographic variable such as poverty rate, to help users predict the public health variable (e.g. smoking rate). If the data were updated in such a way that relationships (i.e. such as a correlation) was diminished or removed, this would affect the playability of the game.

Dealing with data that is updated, refreshed, or otherwise dynamic represents a design challenge. Another example, the California Stimulus Map Game was a game-y infographic created for the Sacramento Bee newspaper website. In this trivia game players had to answer a series of trivia questions about stimulus funds by interacting with a visual map of the state of California. Two weeks after the initial publication the data for the map in the game had already been updated by the government. Not only did this affect the visual representation of the map, but it also impacted the answers to some of the trivia questions, thus forcing the designers to update the game in order to accommodate the new data. One approach to dealing with this issue would be to devise better automatic authoring routines so that trivia answers could be extracted directly from the data without human intervention (e.g. “What is the county with the largest (or smallest) amount of stimulus money”). More research needs to be done to determine the best way for dealing with changes to data which can impact a play experience. Methods developed should be robust to incomplete, inconsistent, or dynamic data and should provide for a playable experience regardless of reasonable changes to such data.

A more general issue with raw data is that the challenge or difficulty of the experience produced in the game is hard to control. With one set of data as an input a game may be too easy but with another it could become too hard. For instance, in Salubrious Nation there were 8 levels, each using a different public health parameter. For each of the levels we measured the average accuracy of the guesses that were produced by the 41 players in our experiment. This is shown in the figure below (with error bars showing the standard deviation of accuracy). As can be seen in the graph, some levels were more difficult than others, even considering some potential learning and improvement by players in the latter levels. This is in contrast to the typical game design pattern of increasing difficulty of levels. Indeed, based on the collected data it may be advisable to re-order the levels in Salubrious Nation so that easier levels are first and more difficult ones later.

In the absence of carefully authored levels of a game, we can still collect log data from players in order to infer difficulty and challenge. While this is relatively straightforward for a puzzle where there is a correct answer and a relatively simple metric can be used to infer difficult, there remain open questions for research. How can log data be used to infer other measures of difficulty (frustration even)? How can playable data games be rapidly and perhaps automatically re-adjusted to assess difficulty so that in a short period when a game is first being played it is able to evolve and adjust itself to provide an appropriately balanced and challenging experience?

These questions apply generally to the gamification of any data-based resource. When gamifying a dynamic, perhaps arbitrarily defined data source, how can we arrive at estimates for the challenge, balance, and playability of those experiences? Properly instrumented such games could perhaps automatically adapt their levels and difficulty to compensate for differences in the input data. I believe that answering these questions will be essential to being able to more rapidly create compelling gamified data experiences in the future.