Category Archives: data

Storytelling with Data Visualization: Context is King

Note: A version of the following also appears on the Tow Center blog.

Data is like a freeze-dried version of reality, abstracted sometimes to the point where it can be hard to recognize and understand. It needs some rehydration before it becomes tasty (or even just palatable) storytelling material again — something that visualization can often help with. But to fully breathe life back into your data, you need to crack your knuckles and add a dose of written explanation to your visualizations as well. Text provides that vital bit of context layered over the data that helps the audience come to a valid interpretation of what it really means.

So how can you use text and visualization together to provide that context and layer a story over your data? Some recently published research by myself and collaborators at the University of Michigan offers some insights.

In most journalistic visualization, context is added to data visualization through the use of labels, captions, and other annotations — texts — of various kinds. Indeed, on the Economist Graphic Detail blog, visualizations not only have integrated textual annotations, but an entire 1-2 paragraph introductory article associated with them. In addition to adding an angle and story to the piece, such contextual journalism helps flesh out what the data means and guides the reader’s interpretation towards valid inferences from the data. Textual annotations integrated directly with a visualization can further guide the users’ interactions, emphasizing certain points, prioritizing particular interpretations of data, or pre-empting the user’s curiosity on seeing a salient outlier, aberration, or trend.

To answer the question of how textual annotations function as story contextualizers in online news visualization we analyzed 136 professionally made news visualizations produced by the New York Times and the Guardian between 2000 and July 2012. Of course we found text used for everything from axes labels, author information, sources, and data provenance, to instructions, definitions, and legends, but we were were less interested in studying these kinds of uses than in annotations that were more related to data storytelling.

Based on our analysis we recognized two underlying functions for annotations: (1) observational, and (2) additive. Observational annotations provide context by supporting reflection on a data value or group of values that are depicted in the visualization. These annotations facilitate comparisons and often highlight or emphasize extreme values or other outliers. For interactive graphics they are sometimes revealed when hovering over a visual element.

A basic form of observational messaging is apparent in the following example from the New York Times, showing the population pyramid in the U.S. On the right of the graphic text clearly indicates observations of the total number and fraction of the population expected to be over age 65 by 2015. This is information that can be observed in the graph but is being reinforced through the use of text.

Another example from the Times shows how observational annotations can be used to highlight and label extremes on a graph. In the chart below, the U.S. budget forecast is depicted, and the low point of 2010 is highlighted with a yellow circle together with an annotation. The value and year of that point are already visible in the graph, which is what makes this kind of annotation observational. Consider using observational annotations when you want to underscore something that’s visible in the visualization, but which you really want to make sure the user sees, or when there is an interesting comparison that you would like to draw the user’s attention towards.

On the other hand, additive annotation provides context that is external to the visual representation and not clearly depicted via the data. These are things that are relevant to the topic or to understanding the data, like background or contemporaneous events or actions. It’s up to you to decide which dimensions of who, what, where, when, why, and how are relevant. If you think the viewer needs to be aware of something in order to interpret the data correctly, then an additive annotation might be appropriate.

The following example from The Minneapolis Star Tribune shows changes in home prices across counties in Minnesota with reference to the peak of the housing bubble, a key bit of additive annotation attached to the year 2007. At the same time, the graphic also uses observational annotation (on the right side) by labeling the median home price and percent change since 2007 for the selected county.

Use of these types of annotation is very prevalent; in our study of 136 examples we found 120 (88.2%) used at least one of these forms of annotation. We also looked at the relative use of each, shown in the next figure. Observational annotations were used in just shy of half of the cases, whereas additive were used in 73%.

Another dimension to annotation is what scope of the visualization is being referenced: an individual datum, a group of data, or the entire view (e.g. a caption-like element). We tabulated the prevalence of these annotation anchors and found that single datum annotations are the most frequently used (74%). The relative usage frequencies are shown in the next figure. Your choice of what scope of the visualization to annotate will often depend on the story you want to tell, or on what kinds of visual features are most visually salient, such as outliers, trends, or peaks. For instance, trends that happen over longer time-frames in a line-graph might benefit from a group annotation to indicate how a collection of data points is trending, whereas a peak in a time-series would most obviously benefit from an annotation calling out that specific data point.

The two types of annotation, and three types of annotation anchoring are summarized in the following chart depicting stock price data for Apple. Annotations A1 and A2 show additive annotations attached to the whole view, and to a specific date in the view, whereas O1 and O2 show observational annotations attached to a single datum and a group of data respectively.

As we come to better understand how to tell stories with text and visualization together, new possibilities also open up for how to integrate text computationally or automatically with visualization.

In our research we used the above insights about how annotations are used by professionals to build a system that analyzes a stock time series (together with its trade volume data) to look for salient points and automatically annotate the series with key bits of additive context drawn from a corpus of news articles. By ranking relevant news headlines and then deriving graph annotations we were able to automatically generate contextualized stock charts and create a user-experience where users felt they had a better grasp of the trends and oscillations of the stock.

On one hand we have the fully automated scenario, but in the future, more intelligent graph authoring tools for journalists might also incorporate such automation to suggest possible annotations for a graph, which an editor could then tweak or re-write before publication. So not only can the study of news visualizations help us understand the medium better and communicate more effectively, but it can also enable new forms of computational journalism to emerge. For all the details please see our research paper, “Contextifier: Automatic Generation of Annotated Stock Visualizations.”

Sex, Violence, and Autocomplete Algorithms: Methods and Context

In my Slate article “Sex, Violence, and Autocomplete Algorithms,” I use a reverse-engineering methodology to better understand what kinds of queries get blocked by Google and Bing’s autocomplete algorithms. In this post I want to pull back the curtains a bit to talk about my process as well as add some context to the data that I gathered for the project.

To measure what kinds of sex terms get blocked I first found a set of sex-related words that are part of a larger dictionary called LIWC (Linguistic Inquiry and Word Count) which includes painstakingly created lists of words for many different concepts like perception, causality, and sex among others. It doesn’t include a lot of slang though, so for that I augmented my sex-word list with some more gems pulled from the Urban Dictionary, resulting in a list of 110 words. The queries I tested included the word by itself, as well as in the phrase “child X” in an attempt to identify suggestions related to child pornography.

For the violence-related words that I tested, I used a set of 348 words from the Random House “violent actions” list, which includes everything from the relatively innocuous “bop” to the more ruthless “strangle.” To construct queries I put the violent words into two phrases: “How to X” and “How can I X.”

Obviously there are many other words and permutations of query templates that I might have used. One of the challenges with this type of project is how to sample data and where to draw the line on what to collect.

With lists of words in hand the next step was to prod the APIs of Google and Bing to see what kind of autocompletions were returned (or not) when queried. The Google API for autocomplete is undocumented, though I found and used some open-source code that had already reverse engineered it. The Bing API is similarly undocumented, but a developer thread on the Bing blog mentions how to access it. I constructed each of my query words and templates and, using these APIs, recorded what suggestions were returned.

An interesting nuance to the data I collected is that both APIs return more responses than actually show up in either user interface. The Google API returns 20 results, but only shows 4 or 10 in the UI depending on how preferences are set. The Bing API returns 12 results but only shows 8 in the UI. Data returned from the API that never appears in the UI is less interesting since users will never encounter it in their daily usage. But, I should mention that it’s not entirely clear what happens with the API results that aren’t shown. It’s possible some of them could be shown during the personalization step of the algorithm (which I didn’t test).

The queries were run and data collected on July 2nd, 2013, which is important to mention since these services can change without notice. Indeed, Google claims to change its search algorithm hundreds of times per year. Autocomplete suggestions can also vary by geography or according to who’s logged in. Since the APIs were accessed programmatically, and no one was logged in, none of the results collected reflect any personalization that the algorithm performs. However, the results may still reflect geography since figuring out where your computer is doesn’t require a log in. The server I used to collect data is located in Delaware. It’s unclear how Google’s “safe search” settings might have affected the data I collected via their API. The Bing spokesperson I was in touch with wrote, “Autosuggest adheres to a ‘strict’ filter policy for all suggestions and therefore applies filtering to all search suggestions, regardless of the SafeSearch settings for the search results page.”

In the spirit of full transparency, here is a .csv to all of the queries and responses that I collected.

The Rhetoric of Data

Note: A version of the following also appears on the Tow Center blog.

In the 1830’s abolitionists discovered the rhetorical potential of re-conceptualizing southern newspaper advertisements as data. They “took an undifferentiated pile of ads for runaway slaves, wherein dates and places were of primary importance … and transformed them into data about the routine and accepted torture of enslaved people,” writes Ellen Gruber Garvey in the book Raw Data is an Oxymoron. By creating topical dossiers of ads, the horrors of slavery were catalogued and made accessible for writing abolitionist speeches and novels. The South’s own media had been re-contextualized into a persuasive weapon against itself, a rhetorical tool to bolster the abolitionists’ arguments.

The Latin etymology of “data” means “something given,” and though we’ve largely forgotten that original definition, it’s helpful to think about data not as facts per se, but as “givens” that can be used to construct a variety of different arguments and conclusions; they act as a rhetorical basis, a premise. Data does not intrinsically imply truth. Yes we can find truth in data, through a process of honest inference. But we can also find and argue multiple truths or even outright falsehoods from data.

Take for instance the New York Times interactive, “One Report, Diverging Perspectives,” which wittingly highlights this issue. Shown below, the piece visualizes jobs and unemployment data from two perspectives, emphasizing the differences in how a democrat or a republican might see and interpret the statistics. A rising tide of “data PR” often manifesting as slick and pointed infographics won’t be so upfront about the perspectives being argued though. Advocacy organizations can now collect their own data, or just develop their own arguments from existing data for supporting their cause. What should you be looking out for as a journalist when assessing a piece of data PR? And how can you improve your own data journalism by ensuring the argument you develop is a sound one?

one report diverging perspectives

Contextual journalism—adding interpretation or explanation to a story—can and should be applied to data as much as to other forms of reporting. It’s important because the audience may need to know the context of a dataset in order to fully understand and evaluate the larger story in perspective. For instance, context might include explaining how the data was collected, defined, and aggregated, and what human decision processes contributed to its creation. Increasingly news outlets are providing sidebars or blog posts that fully describe the methodology and context of the data they use in a data-driven story. That way the context doesn’t get in the way of the main narrative but can still be accessed by the inquisitive reader.

In your process it can be useful to ask a series of contextualizing questions about a dataset, whether just critiquing the data, or producing your own story.

Who produced the data and what was their intent? Did it come from a reputable source, like a government or inter-governmental agency such as the UN, or was it produced by a third party corporation with an uncertain source of funding? Consider the possible political or advocacy motives of a data provider as you make inferences from that data, and do some reporting if those motives are unclear.

When was the data collected? Sometimes there can be temporal drift in what data means, how it’s measured, or how it should be interpreted. Is the age of your data relevant to your interpretation? For example, in 2010 the Bureau of Labor Statistics changed the definition of long-term unemployment, which can make it important to recognize that shift when comparing data from before and after the change.

Most importantly it’s necessary to ask what is measured in the data, how was it sampled, and what is ultimately depicted? Are data measurements defined accurately and in a way that they can be consistently measured? How was the data sampled from the world? Is the dataset comprehensive or is it missing pieces? If the data wasn’t randomly sampled how might that reflect a bias in your interpretation? Or have other errors been introduced into the data, for instance through typos or mistaken OCR technology? Is there uncertainty in the data that should be communicated to the reader? Has the data been cropped or filtered in a way that you have lost a potentially important piece of context that would change its interpretation? And what about aggregation or transformation? If a dataset is offered to you with only averages or medians (i.e. aggregations) you’re necessarily missing information about how the data might be distributed, or about outliers that might make interesting stories. For data that’s been transformed through some algorithmic process, such as classification, it can be helpful to know the error rates of that transformation as this can lead to additional uncertainty in the data.

Let’s consider an example that illustrates the importance of measurement definition and aggregation. The Economist graphic below shows the historic and forecast vehicle sales for different geographies. The story the graph tells is pretty clear: Sales in China are rocketing up while they’re declining or stagnant in North America and Europe. But look more closely. The data for Western Europe and North America is defined as an aggregation of light vehicle sales, according to the note in the lower-right corner. How would the story change if the North American data included truck, SUV, and minivan sales? The story you get from these kinds of data graphics can depend entirely on what’s aggregated (or not aggregated) together in the measure. Aggregations can serve as a tool of obfuscation, whether intentional or not.

 vehicle sales

It’s important to recognize and remember that data does not equal truth. It’s rhetorical by definition and can be used for truth finding or truth hiding. Being vigilant in how you develop arguments from data and showing the context that leads to the interpretation you make can only help raise the credibility of your data-driven story.

 

Data on the Growth of CitiBike

On May 27th New York City launched its city-wide bike sharing program, CitiBike. I tried it out last weekend; it was great, aside from a few glitches checking-out and checkin-in the bikes. It made me curious about the launch of the program and how it’s growing, especially since the agita between bikers and drivers is becoming quite palpable. Luckily, the folks over at the CitiBike blog have been posting daily stats about the number of rides every day, average duration of rides, and even the most popular station for starting and stopping a ride. If you’re interested in hacking more on the data there’s even a meetup happening next week.

Below is my simple line chart of the total number of daily riders (they measure that as of 5pm that day). Here’s the data. You might look at the graph and wonder, “what happened June 7th?”. That was the monsoon we had. Yeah, turns out bikers don’t like rain.

citibike2

51% Foreign: Algorithms and the Surveillance State

In New York City there’s a “geek squad” of analysts that gathers all kinds of data, from restaurant inspection grades and utility usage to neighborhood complaints, and uses it to predict how to improve the city. The idea behind the team is that with more and more data available about how the city is running—even if it’s messy, unstructured, and massive—the government can optimize its resources by keeping an eye out for what needs its attention most. It’s really about city surveillance, and of course acting on the intelligence produced by that surveillance.

One story about the success of the geek squad comes to us from Viktor Mayer-Schonberger and Kenneth Cukier in their book “Big Data”. They describe the issue of illegal real-estate conversions, which involves sub-dividing an apartment into smaller and smaller units so that it can accommodate many more people than it should. With the density of people in such close quarters, illegally converted units are more prone to accidents, like fire. So it’s in the city’s—and the public’s—best interest to make sure apartment buildings aren’t sub-divided like that. Unfortunately there aren’t very many inspectors to do the job. But by collecting and analyzing data about each apartment building the geek squad can predict which units are more likely to pose a danger, and thus determine where the limited number of inspectors should focus their attention. Seventy percent of inspections now lead to eviction orders from unsafe dwellings, up from 13% without using all that data—a clear improvement in helping inspectors focus on the most troubling cases.

Consider a different, albeit hypothetical, use of big data surveillance in society: detecting drunk drivers. Since there are already a variety of road cameras and other traffic sensors available on our roads, it’s not implausible to think that all of this data could feed into an algorithm that says, with some confidence, that a car is exhibiting signs of erratic, possibly drunk driving. Let’s say, similar to the fire-risk inspections, that this method also increases the efficiency of the police department in getting drunk drivers off the road—a win for public safety.

But there’s a different framing at work here. In the fire-risk inspections the city is targeting buildings, whereas in the drunk driving example it’s really targeting the drivers themselves. This shift in framing—targeting the individual as opposed to the inanimate–crosses the line into invasive, even creepy, civil surveillance.

So given the degree to which the recently exposed government surveillance programs target individual communications, it’s not as surprising that, according to Gallup, more Americans disapprove (53%) than approve (37%) of the federal government’s program to “compile telephone call logs and Internet communications.” This is despite the fact that such surveillance could in a very real way contribute to public safety, just as with the fire-risk or drunk driving inspections.

At the heart of the public’s psychological response is the fear and risk of surveillance uncovering personal communication, of violating our privacy. But this risk is not a foregone conclusion. There’s some uncertainty and probability around it, which makes it that much harder to understand the real risk. In the Prism program, the government surveillance program that targets internet communications like email, chats, and file transfers, the Washington Post describes how analysts use the system to “produce at least 51 percent confidence in a target’s ‘foreignness’”. This test of foreignness is tied to the idea that it’s okay (legally) to spy on foreign communications, but that it would breach FISA (the Foreign Intelligence Surveillance Act), as well as 4th amendment rights for the government to do the same to American citizens.

Platforms used by Prism, such as Google and Facebook, have denied that they give the government direct access to their servers. The New York Times reported that the system in place is more like having a locked mailbox where the platform can deposit specific data requested pursuant to a court order from the Foreign Intelligence Surveillance Court. But even if such requests are legally targeted at foreigners and have been faithfully vetted by the court, there’s still a chance that ancillary data on American citizens will be swept up by the government. “To collect on a suspected spy or foreign terrorist means, at minimum, that everyone in the suspect’s inbox or outbox is swept in,” as the Washington Post writes. And typically data is collected not just of direct contacts, but also contacts of contacts. This all means that there’s a greater risk that the government is indeed collecting data on many Americans’ personal communications.

Algorithms, and a bit of transparency on those algorithms, could go a long way to mitigating the uneasiness over domestic surveillance of personal communications that American citizens may be feeling. The basic idea is this: when collecting information on a legally identified foreign target, for every possible contact that might be swept up with the target’s data, an automated classification algorithm can be used to determine whether that contact is more likely to be “foreign” or “American”. Although the algorithm would have access to all the data, it would only output one bit of metadata for each contact: is the contact foreign or not? Only if the contact was deemed highly likely to be foreign would the details of that data be passed on to the NSA. In other words, the algorithm would automatically read your personal communications and then signal whether or not it was legal to report your data to intelligence agencies, much in the same way that Google’s algorithms monitor your email contents to determine which ads to show you without making those emails available for people at Google to read.

The FISA court implements a “minimization procedure” in order to curtail incidental data collection from people not covered in the order, though the exact process remains classified. Marc Ambinder suggests that, “the NSA automates the minimization procedures as much as it can” using a continuously updated score that assesses the likelihood that a contact is foreign.  Indeed, it seems at least plausible that the algorithm I suggest above could already be a part of the actual minimization procedure used by NSA.

The minimization process reduces the creepiness of unfettered government access to personal communications, but at the same time we still need to know how often such a procedure makes mistakes. In general there are two kinds of mistakes that such an algorithm could make, often referred to as false positives and false negatives. A false negative in this scenario would indicate that a foreign contact was categorized by the algorithm as an American. Obviously the NSA would like to avoid this type of mistake since it would lose the opportunity to snoop on a foreign terrorist. The other type of mistake, false positive, corresponds to the algorithm designating a contact as foreign even though in reality it’s American. The public would want to avoid this type of mistake because it’s an invasion of privacy and a violation of the 4th amendment. Both of these types of errors are shown in the conceptual diagram below, with the foreign target marked with an “x” at the center and ancillary targets shown as connected circles (orange is foreign, blue is American citizen).

diagram

It would be a shame to disregard such a potentially valuable tool simply because it might make mistakes from time to time. To make such a scheme work we first need to accept that the algorithm will indeed make mistakes. Luckily, such an algorithm can be tuned to make more or less of either of those mistakes. As false positives are tuned down false negatives will often increase, and vice versa. The advantage for the public would be that it could have a real debate with the government about what magnitude of mistakes is reasonable. How many Americans being labeled as foreigners and thus subject to unwarranted search and seizure is acceptable to us? None? Some? And what’s the trade-off in terms of how many would-be terrorists might slip through if we tuned the false positives down?

To begin a debate like this the government just needs to tell us how many of each type of mistake its minimization procedure makes; just two numbers. In this case, minimal transparency of an algorithm could allow for a robust public debate without betraying any particular details or secrets about individuals. In other words, we don’t particularly need to know the gory details of how such an algorithm works. We simply need to know where the government has placed the fulcrum in the tradeoff between these different types of errors. And by implementing smartly transparent surveillance maybe we can even move more towards the world of the geek squad, where big data is still ballyhooed for furthering public safety.

Storytelling with Data: What Are the Impacts on the Audience?

Storytelling with data visualization is still very much in its “Wild West” phase, with journalism outlets blazing new paths in exploring the burgeoning craft of integrating the testimony of data together with compelling narrative. Leaders such as The News York Times create impressive data-driven presentations like 512 Paths to the White House (seen above) that weave complex information into a palatable presentation. But as I look out at the kinds of meetings where data visualizers converge, like EyeoTapestryOpenVis, and the infographics summit Malofiej, I realize there’s a whole lot of inspiration out there, and some damn fine examples of great work, but I still find it hard to get a sense of direction — which way is West, which way to the promised land?

And it occurred to me: We need a science of data-visualization storytelling. We need some direction. We need to know what makes a data story “work”. And what does a data story that “works” even mean?

Examples abound, and while we have theories for color use, visual salience and perception, and graph design that suggest how to depict data efficiently, we still don’t know, with any particular scientific rigor, which are better stories. At the Tapestry conference, where I attended, journalists such as Jonathan CorumHannah Fairfield, and Cheryl Phillips whipped out a staggering variety of examples in their presentations. Jonathan, in his keynote, talked about “A History of the Detainee Population” an interactive NYT graphic (partially excerpted below) depicting how Guantanamo prisoners have, over time, slowly been moved back to their country of origin. I would say that the presentation is effective. I “got” the message. But I also realize that, because the visualization is animated, it’s difficult to see the overall trend over time — to compare one year to the next. There are different ways to tell this story, some of which may be more effective than others for a range of storytelling goals.

guantanamo

Critical blogs such as The Why Axis and Graphic Sociology have arisen to try to fill the gap of understanding what works and what doesn’t. And research on visualization rhetoric has tried to situate narrative data visualization in terms of the rhetorical techniques authors may use to convey their story. Useful as these efforts are in their thick description and critical analysis, and for increasing visual literacy, they don’t go far enough toward building predictive theories of how data-visualization stories are “read” by the audience at large.

Corum, a graphics editor at NYT, has a descriptive framework to explain his design process and decisions. It describes the tensions between interactivity and story, between oversimplification and overwhelming detail, and between exploration and decoration. Other axes of design include elements such as focus versus depth and the author versus the audience. Author and educator Alberto Cairo exhibits similar sets of design dimensions in his book, “The Functional Art“, which start to trace the features along which data-visualization stories can vary (recreated below).

vis wheel

Such descriptions are a great starting point, but to make further progress on interactive data storytelling we need to know which of the many experiments happening out in the wild are having their desired effect on readers. Design decisions like how and where annotations are placed on a visualization, how the story is structured across the canvas and over time, the graphical style including things like visual embellishments and novelties, as well as data mapping and aggregation can all have consequences on how the audience perceives the story. How does the effect on the audience change when modulating these various design dimensions? A science of data-visualization storytelling should seek to answer that question.

But still the question looms: What does a data story that “works” even mean? While efficiency and parsimony of visual representation may still be important in some contexts, I believe online storytelling demands something else. What effects on the audience should we measure? As data visualization researcher Robert Kosara writes in his forthcoming IEEE Computer article on the subject, “there are no clearly defined metrics or evaluation methods … Developing these will require the definition of, and agreement on, goals: what do we expect stories to achieve, and how do we measure it?”

There are some hints in recent research in information visualization for how we might evaluate visualizations that communicate or present information. We might for instance ask questions about how effectively a message is acquired by the audience: Did they learn it faster or better? Was is memorable, or did they forget it 5 minutes, 5 hours, or 5 weeks later? We might ask whether the data story spurred any personal insights or questions, and to what degree users were “engaged” with the presentation. Engaged here could mean clicks and hovers of the mouse on the visualization, how often widgets and filters for the presentation were touched, or even whether users shared or conversed around the visualization. We might ask if users felt they understood the context of the data and if they felt confident in their interpretation of the story: Did they feel they could make an informed decision on some issue based on the presentation? Credibility being an important attribute for news outlets, we might wonder whether some data story presentations are more trustworthy than others. In some contexts a presentation that is persuasive is the most important factor. Finally, since some of the best stories are those that evoke emotional responses, we might ask how to do the same with data stories.

Measuring some of these factors is as straightforward as instrumenting the presentations themselves to know where users moved their mouse, clicked, or shared. There are a variety of remote usability testing services that can already help with that. Measuring other factors might require writing and attaching survey questions to ask users about their perceptions of the experience. While the best graphics departments do a fair bit of internal iteration and testing it would be interesting to see what they could learn by setting up experiments that varied their designs minutely to see how that affected the audience along any of the dimensions delineated above. More collaboration between industry and academia could accelerate this process of building knowledge of the impact of data stories on the audience.

I’m not arguing that the creativity and boundary-pushing in data-visualization storytelling should cease. It’s inspiring looking at the range of visual stories that artists and illustrators produce. And sometimes all you really want is an amuse yeux — a little bit of visual amusement. Let’s not get rid of that. But I do think we’re at an inflection point where we know enough of the design dimensions to start building models of how to reliably know what story designs achieve certain goals for different kinds of story, audience, data, and context. We stand only to be able to further amplify the impact of such stories by studying them more systematically.

How does newspaper circulation relate to Twitter following?

I was recently looking at circulation numbers from the Audit Bureau of Circulation for the top twenty-five newspapers in the U.S. and wondered: How does circulation relate to Twitter following? So for each newspaper I found the Twitter account and recorded the number of followers (link to data). The graph below shows the ratio of Twitter followers to total circulation; you could say it’s some kind of measure of how well the newspaper has converted its circulation into a social media following.

You can clearly see national papers like the NYT and Washington Post rise above the rest, but for others like USA Today it’s surprising that with a circulation of about 1.7M, they have comparatively few — only 514k — Twitter followers. This may say something about the audience of that paper and whether that audience is online and using social media. For instance, Pew has reported stats that suggest that people over the age of 50 use Twitter at a much lower than average rate. Another possible explanation is that a lot of the USA Today circulation is vapor; I can’t remember how many times I’ve stayed at a hotel where USA Today was left for me by default, only to be left behind unread. Finally, maybe USA Today is just not leading an effective social strategy and they need to get better about reaching, and appealing to, the social media audience.

There are some metro papers like NY Post and LA Times that also have decent ratios, indicating they’re addressing a fairly broad national or regional audience with respect to their circulation. But the real winners in the social world are NYT and WashPost, and maybe WSJ to some extent. And in this game of web scale audiences, the big will only get bigger as they figure out how to transcend their own limited geographies and expand into the social landscape.

newspaper graph

The Future of Automated Story Production

Note: this is cross-posted on the CUNY Tow-Knight Center for Entrepreneurial Journalism site. 

Recently there’s been a surge of interest in automatically generating news stories. The poster child is a start-up called Narrative Science which has earned coverage by the likes of the New York Times, Wired, and numerous blogs for its ability to automatically produce actual, readable stories of things like sports games or companies’ financial reports based on nothing more than numeric data. It’s impressive stuff, but it doesn’t stop me from thinking: What’s next? In the rest of this post I’ll talk about some challenges, such as story schema and modality, data context, and text transparency, that could improve future story generation engines.

Without inside information we can’t say for sure exactly how Narrative Science (NS) works, though there are some academic systems out there that provide a suitable analogue for description. There are two main phases that have to be automated in order to produce a story this way: the analysis phase and the generative phase. In the analysis phase, numeric data is statistically analyzed for things like trends, clusters, patterns, and outliers or exceptions. The analysis phase also includes the challenging aspect of condensing or selecting the most interesting things to include in the story (see Ramesh Jain’s “Extreme Stories” for more on this).

Followed by analysis and selection comes the task of figuring out an interesting structure to order the information in the story, a schema. Narrative Science differentiates itself primarily, I think, by paying close attention to the structure of the stories it generates. Many of the precursors to NS were stuck in the mode of presenting generated text in a chronological schema, which, as we know is quite boring for most stories. Storytelling is really all about structure: providing the connections between aspects of the story, its actors and setting, using some rhetorical ordering that makes sense for and engages the reader. There are whole books written on how to effectively structure stories to explore different dramatic arcs or genres. Many of these different story structures have yet to be encoded in algorithms that generate text from data, so there’s lots of room for future story generation engines to explore diverse text styles, genres, and dramatic arcs.

It’s also important to remember that text has limitations on the structures and the schema it supports well. A textual narrative schema might draw readers in, but, depending on the data, a network schema or a temporal schema might expose different aspects of a story that aren’t apparent, easy, or engaging to represent in text. This leads us to another opportunity for advancement in media synthesis: better integration of textual schema with visualization schemas (e.g. temporal, hierarchical, network). For instance, there may be complementary stories (e.g. change over time, comparison of entities) that are more effectively conveyed through dynamic visualizations than through text. Combining these two modalities has been explored in some research but there is much work to do in thinking about how best to combine textual schema with different visual schema to effectively convey a story.

There has also been recent work looking into how data can be used to generate stories in the medium of video. This brings with it a whole slew of challenges different than text generation, such as the role of audio, and how to crop and edit existing video into a coherent presentation. So, in addition to better incorporating visualization into data-driven stories I think there are opportunities to think about automatically composing stories from such varied modalities as video, photos, 3D, games, or even data-based simulations. If you have the necessary data for it, why not include an automatically produced simulation to help communicate the story?

It may be surprising to know that text generation from data has actually been around for some time now. The earliest reference that I found goes back 26 years to a paper that describes how to automatically create written weather reports based on data. And then ten years ago, in 2002, we saw the launch of Newsblaster, a complex news summarization engine developed at Columbia University that took articles as a data source and produced new text-based summaries using articles clustered around news events. It worked all right, though starting from text as the data has its own challenges (e.g. text understanding) that you don’t run into if you’re just using numeric data. The downside of using just numeric data is that it is largely bereft of context. One way to enhance future story generation engines could be to better integrate text generated by numeric data together with text (collected from clusters of human-written articles) that provides additional context.

The last opportunity I’d like to touch on here relates to the journalistic ideal of transparency. I think we have a chance to embed this ideal into algorithms that produce news stories, which often articulate a communicative intent combined with rules or templates that help achieve that intent. It is largely feasible to link any bit of generated text back to the data that gave rise to that statement – in fact it’s already done by Narrative Science in order to debug their algorithms. But this linking of data to statement should be exposed publicly. In much the same way that journalists often label their graphics and visualizations with the source of their data, text generated from data should source each statement. Another dimension of transparency practiced by journalists is to be up-front about the journalist’s relationship to the story (e.g. if they’re reporting on a company that they’re involved with). This raises an interesting and challenging question of self-awareness for algorithms that produce stories. Take for instance this Forbes article produced by Narrative Science about New York Times Co. earnings. The article contains a section on “competitors”, but the NS algorithm isn’t smart enough or self-aware enough to know that it itself is an obvious competitor. How can algorithms be taught to be transparent about their own relationships to stories?

There are tons of exciting opportunities in the space of media synthesis. Challenges like exploring different story structures and schemas, providing and integrating context, and embedding journalistic ideals such as transparency will keep us more than busy in the years and, likely, decades to come.

Authoring Data-Driven Documents

Over the last few months I’ve been learning D3 (Data-Driven Documents), which is a really powerful data visualization library built for javascript. The InfoVis paper gets to the gritty details of how it supports data transformations, immediate evaluation of attributes, and a native SVG representation. These features can be more or less helpful depending on what kind of visualization you’re working on. For instance, transformations don’t really matter if you’re just building static graphs. But being able to inspect the SVG representation of your visualization (and edit it in the console) is really quite helpful and powerful.

But for all the power that D3 affords, is programming really how we should be (want to be?) authoring visualizations?

Here’s something that I recently made with D3. It’s a story about U.S. manufacturing productivity, employment, and automation told across a series of panels programmed using D3.

Now, of course, the exploratory data analysis, storyboarding, and research needed to tell this story were time-consuming. But after all that, using D3 to render the graphs I wanted was substantially more tedious and time-consuming than I would have liked. I think this was because (1) my knowledge of SVG is not fantastic and I’m still learning that, but more importantly (2) D3 supports very low-level operations that make high level activities for basic data storytelling time-consuming to implement. And yes, D3 does provide a number of helper modules and layouts, but these aren’t documented with clear examples using concrete data that would make it obvious how to easily utilize them. Having support for the library on jsFiddle, together with some very simple examples would go a long way towards helping noobs (like me!) ramp up.

But, really, where’s the flash-like authoring tool of data visualization? Such a tool could be used to interactively manipulate a D3 visualization and, when you’re done, output HTML + CSS + D3 code to generate your graphs (including animation, transitions, etc.). The tool would also include basic graph templates that could be populated with your data and customized. Basic storytelling functions for highlighting important aspects or comparisons of the data (e.g. through animation, color, juxtaposition, etc.), or using text to annotate and explain the data could also be supported. D3 suffers from a bit of a usability problem right now, and powerful as it is, authoring stories with visualization doesn’t need to be, nor should it be, bound up in programming.

Visualization, Data, and Social Media Response

I’ve been looking into how people comment on data and visualization recently and one aspect of that has been studying the Guardian’s Datablog. The Datablog publishes stories of and about data, oftentimes including visualizations such as charts, graphs, or maps. It also has a fairly vibrant commenting community.

So I set out to gather some of my own data. I scraped 803 articles from the Datablog including all of their comments. Of this data I wanted to know if articles which contained embedded data tables or embedded visualizations produced more of a social media response. That is, do people talk more about the article if it contains data and/or visualization? The answer is yes, and the details are below.

While the number of comments could be scraped off of the Datablog site itself I turned to Mechanical Turk to crowdsource some other elements of metadata collection: (1) the number of tweets per article, (2) whether the article has an embedded data table, and (3) whether the article has an embedded visualization. I did a spot check on 3% of the results from Turk in order to assess the Turkers’ accuracy on collecting these other pieces of metadata: it was about 96% overall, which I thought was clean enough to start doing some further analysis.

So next I wanted to look at how the “has visualization” and “has table” features affect (1) tweet volume, and (2) comment volume. There are four possibilities: the article has (1) a visualization and a table, (2) a visualization and no table, (3) no visualization and a table, (4) no visualization and no table. Since both the tweet volume and comment volume are not normally distributed variables I log transformed them to get them to be normal (this is an assumption of the following statistical tests). Moreover, there were a few outliers in the data and so anything beyond 3 standard deviations from the mean of the log transformed variables was not considered.

For number of tweets per article:

  1. Articles with both a visualization and a table produced the largest response with an average of 46 tweets per article (N=212, SD=103.24);
  2. Articles with a visualization and no table produced an average of 23.6 tweets per article (N=143, SD=85.05);
  3. Articles with no visualization and a table produced an average of 13.82 tweets per article (N=213, SD=42.7);
  4. And finally articles with neither visualization nor table produced an average of 19.56 tweets per article (N=117, SD=86.19).

I ran an ANOVA with post-hoc Bonferroni tests to see if these means were significant. Articles with both a visualization and a table (case 1) have a significantly higher number of tweets than cases 3 (p < .01) and 4 (p < .05). Articles with just the visualization and no data table have a higher number of average tweets per article, but this was not statistically significant. The take away is that it seems that the combination of a visualization and a data table drives a significantly higher twitter response.

Results for number of comments per article are similar:

  1. Articles with both a visualization and a table produced the largest response with an average of 17.40 comments per article (SD=24.10);
  2. Articles with a visualization and no table produced an average of 12.58 comments per article (SD=17.08);
  3. Articles with no visualization and a table produced an average of 13.78 comments per article (SD=26.15);
  4. And finally articles with neither visualization nor table produced an average of 11.62 comments per article (SD=17.52)

Again with the ANOVA and post-hoc Bonferroni tests to assess statistically significant differences between means. This time there was only one statistically significant difference: Articles with both a visualization and a table (case 1) have a higher number of comments than articles with neither a visualization nor a table (case 4). The p value was 0.04. Again, the combination of visualization and data table drove more of an audience response in terms of commenting behavior.

The overall take-away here is that people like to talk about articles (at least in the context of the audience of the Guardian Datablog) when both data and visualization are used to tell the story. Articles which used both had more than twice the number of tweets and about 1.5 times the number of comments versus articles which had neither. If getting people talking about your reporting is your goal, use more data and visualization, which, in retrospect, I probably also should have done for this blog post.

As a final thought I should note there are potential confounds in these results. For one, articles with data in them may stay “green” for longer thus slowly accreting a larger and larger social media response. One area to look at would be the acceleration of commenting in addition to volume. Another thing that I had no control over is whether some stories are promoted more than others: if the editors at the Guardian had a bias to promote articles with both visualizations and data then this would drive the audience response numbers up on those stories too. In other words, it’s still interesting and worthwhile to consider various explanations for these results.