Note: A version of the following also appears on the Tow Center blog.
Data is like a freeze-dried version of reality, abstracted sometimes to the point where it can be hard to recognize and understand. It needs some rehydration before it becomes tasty (or even just palatable) storytelling material again — something that visualization can often help with. But to fully breathe life back into your data, you need to crack your knuckles and add a dose of written explanation to your visualizations as well. Text provides that vital bit of context layered over the data that helps the audience come to a valid interpretation of what it really means.
So how can you use text and visualization together to provide that context and layer a story over your data? Some recently published research by myself and collaborators at the University of Michigan offers some insights.
In most journalistic visualization, context is added to data visualization through the use of labels, captions, and other annotations — texts — of various kinds. Indeed, on the Economist Graphic Detail blog, visualizations not only have integrated textual annotations, but an entire 1-2 paragraph introductory article associated with them. In addition to adding an angle and story to the piece, such contextual journalism helps flesh out what the data means and guides the reader’s interpretation towards valid inferences from the data. Textual annotations integrated directly with a visualization can further guide the users’ interactions, emphasizing certain points, prioritizing particular interpretations of data, or pre-empting the user’s curiosity on seeing a salient outlier, aberration, or trend.
To answer the question of how textual annotations function as story contextualizers in online news visualization we analyzed 136 professionally made news visualizations produced by the New York Times and the Guardian between 2000 and July 2012. Of course we found text used for everything from axes labels, author information, sources, and data provenance, to instructions, definitions, and legends, but we were were less interested in studying these kinds of uses than in annotations that were more related to data storytelling.
Based on our analysis we recognized two underlying functions for annotations: (1) observational, and (2) additive. Observational annotations provide context by supporting reflection on a data value or group of values that are depicted in the visualization. These annotations facilitate comparisons and often highlight or emphasize extreme values or other outliers. For interactive graphics they are sometimes revealed when hovering over a visual element.
A basic form of observational messaging is apparent in the following example from the New York Times, showing the population pyramid in the U.S. On the right of the graphic text clearly indicates observations of the total number and fraction of the population expected to be over age 65 by 2015. This is information that can be observed in the graph but is being reinforced through the use of text.
Another example from the Times shows how observational annotations can be used to highlight and label extremes on a graph. In the chart below, the U.S. budget forecast is depicted, and the low point of 2010 is highlighted with a yellow circle together with an annotation. The value and year of that point are already visible in the graph, which is what makes this kind of annotation observational. Consider using observational annotations when you want to underscore something that’s visible in the visualization, but which you really want to make sure the user sees, or when there is an interesting comparison that you would like to draw the user’s attention towards.
On the other hand, additive annotation provides context that is external to the visual representation and not clearly depicted via the data. These are things that are relevant to the topic or to understanding the data, like background or contemporaneous events or actions. It’s up to you to decide which dimensions of who, what, where, when, why, and how are relevant. If you think the viewer needs to be aware of something in order to interpret the data correctly, then an additive annotation might be appropriate.
The following example from The Minneapolis Star Tribune shows changes in home prices across counties in Minnesota with reference to the peak of the housing bubble, a key bit of additive annotation attached to the year 2007. At the same time, the graphic also uses observational annotation (on the right side) by labeling the median home price and percent change since 2007 for the selected county.
Use of these types of annotation is very prevalent; in our study of 136 examples we found 120 (88.2%) used at least one of these forms of annotation. We also looked at the relative use of each, shown in the next figure. Observational annotations were used in just shy of half of the cases, whereas additive were used in 73%.
Another dimension to annotation is what scope of the visualization is being referenced: an individual datum, a group of data, or the entire view (e.g. a caption-like element). We tabulated the prevalence of these annotation anchors and found that single datum annotations are the most frequently used (74%). The relative usage frequencies are shown in the next figure. Your choice of what scope of the visualization to annotate will often depend on the story you want to tell, or on what kinds of visual features are most visually salient, such as outliers, trends, or peaks. For instance, trends that happen over longer time-frames in a line-graph might benefit from a group annotation to indicate how a collection of data points is trending, whereas a peak in a time-series would most obviously benefit from an annotation calling out that specific data point.
The two types of annotation, and three types of annotation anchoring are summarized in the following chart depicting stock price data for Apple. Annotations A1 and A2 show additive annotations attached to the whole view, and to a specific date in the view, whereas O1 and O2 show observational annotations attached to a single datum and a group of data respectively.
As we come to better understand how to tell stories with text and visualization together, new possibilities also open up for how to integrate text computationally or automatically with visualization.
In our research we used the above insights about how annotations are used by professionals to build a system that analyzes a stock time series (together with its trade volume data) to look for salient points and automatically annotate the series with key bits of additive context drawn from a corpus of news articles. By ranking relevant news headlines and then deriving graph annotations we were able to automatically generate contextualized stock charts and create a user-experience where users felt they had a better grasp of the trends and oscillations of the stock.
On one hand we have the fully automated scenario, but in the future, more intelligent graph authoring tools for journalists might also incorporate such automation to suggest possible annotations for a graph, which an editor could then tweak or re-write before publication. So not only can the study of news visualizations help us understand the medium better and communicate more effectively, but it can also enable new forms of computational journalism to emerge. For all the details please see our research paper, “Contextifier: Automatic Generation of Annotated Stock Visualizations.”
Visualization, Data, and Social Media Response
I’ve been looking into how people comment on data and visualization recently and one aspect of that has been studying the Guardian’s Datablog. The Datablog publishes stories of and about data, oftentimes including visualizations such as charts, graphs, or maps. It also has a fairly vibrant commenting community.
So I set out to gather some of my own data. I scraped 803 articles from the Datablog including all of their comments. Of this data I wanted to know if articles which contained embedded data tables or embedded visualizations produced more of a social media response. That is, do people talk more about the article if it contains data and/or visualization? The answer is yes, and the details are below.
While the number of comments could be scraped off of the Datablog site itself I turned to Mechanical Turk to crowdsource some other elements of metadata collection: (1) the number of tweets per article, (2) whether the article has an embedded data table, and (3) whether the article has an embedded visualization. I did a spot check on 3% of the results from Turk in order to assess the Turkers’ accuracy on collecting these other pieces of metadata: it was about 96% overall, which I thought was clean enough to start doing some further analysis.
So next I wanted to look at how the “has visualization” and “has table” features affect (1) tweet volume, and (2) comment volume. There are four possibilities: the article has (1) a visualization and a table, (2) a visualization and no table, (3) no visualization and a table, (4) no visualization and no table. Since both the tweet volume and comment volume are not normally distributed variables I log transformed them to get them to be normal (this is an assumption of the following statistical tests). Moreover, there were a few outliers in the data and so anything beyond 3 standard deviations from the mean of the log transformed variables was not considered.
For number of tweets per article:
I ran an ANOVA with post-hoc Bonferroni tests to see if these means were significant. Articles with both a visualization and a table (case 1) have a significantly higher number of tweets than cases 3 (p < .01) and 4 (p < .05). Articles with just the visualization and no data table have a higher number of average tweets per article, but this was not statistically significant. The take away is that it seems that the combination of a visualization and a data table drives a significantly higher twitter response.
Results for number of comments per article are similar:
Again with the ANOVA and post-hoc Bonferroni tests to assess statistically significant differences between means. This time there was only one statistically significant difference: Articles with both a visualization and a table (case 1) have a higher number of comments than articles with neither a visualization nor a table (case 4). The p value was 0.04. Again, the combination of visualization and data table drove more of an audience response in terms of commenting behavior.
The overall take-away here is that people like to talk about articles (at least in the context of the audience of the Guardian Datablog) when both data and visualization are used to tell the story. Articles which used both had more than twice the number of tweets and about 1.5 times the number of comments versus articles which had neither. If getting people talking about your reporting is your goal, use more data and visualization, which, in retrospect, I probably also should have done for this blog post.
As a final thought I should note there are potential confounds in these results. For one, articles with data in them may stay “green” for longer thus slowly accreting a larger and larger social media response. One area to look at would be the acceleration of commenting in addition to volume. Another thing that I had no control over is whether some stories are promoted more than others: if the editors at the Guardian had a bias to promote articles with both visualizations and data then this would drive the audience response numbers up on those stories too. In other words, it’s still interesting and worthwhile to consider various explanations for these results.