Computational Journalism and The Reporting of Algorithms

Note: A version of the following also appears on the Tow Center blog.

Software and algorithms have come to adjudicate an ever broader swath of our lives, including everything from search engine personalization and advertising systems, to teacher evaluation, banking and finance, political campaigns, and police surveillance.  But these algorithms can make mistakes. They have biases. Yet they sit in opaque black boxes, their inner workings, their inner “thoughts” hidden behind layers of complexity.

We need to get inside that black box, to understand how they may be exerting power on us, and to understand where they might be making unjust mistakes. Traditionally, investigative journalists have helped hold powerful actors in business or government accountable. But today, algorithms, driven by vast troves of data, have become the new power brokers in society. And the automated decisions of algorithms deserve every bit as much scrutiny as other powerful and influential actors.

Today the Tow Center publishes a new Tow/Knight Brief, “Algorithmic Accountability Reporting: On the Investigation of Black Boxes” to start tackling this issue. The Tow/Knight Brief presents motivating questions for why algorithms are worthy of our investigations, and develops a theory and method based on the idea of reverse engineering that can help parse how algorithms work. While reverse engineering shows promise as a method, it will also require the dedicated investigative talents of journalists interviewing algorithms’ creators as well. Algorithms are, after all, manifestations of human design.

If you’re in NYC next week, folks from the New York Times R&D lab are pushing the idea forward in their Impulse Response Workshop. And if you’re at IRE and NICAR’s 2014 CAR Conference in Baltimore on Feb 28th, I’ll be joined by Chase Davis, Frank Pasquale, and Jeremy Singer-Vine for an in-depth discussion on holding algorithms accountable. In the mean time, have a read of the paper, and let me know your thoughts, comments, and critiques.

Making Data More Familiar with Concrete Scales

Note: A version of the following also appears on the Tow Center blog.

 

As part of their coverage of the Snowden leaks, last month the Guardian published an interactive to help explain what the NSA data collection activities mean for the public. Above is a screenshot of part of the piece. It allows the user to input the number of friends they have on Facebook and see a typical number of 1st, 2nd (friends-of-friends), and 3rd (friends-of-friends-of-friends) degree connections as compared to places where you typically find different numbers of people. So 250 friends is more than the capacity of a subway car, 40,850 friends-of-friends is more than would fit in Fenway Park, and 6.7 million 3rd degree connections is bigger than the population of Massachusetts.

When we tell stories with data it can be hard for readers to grasp units or measures that are outside of normal human experience or outside of their own personal experience. How much *is* 1 trillion dollars, or 200 calories, really? Unless you’re an economist, or a nutritionist respectively it might be hard to say. Abstract measures and units can benefit from making them more concrete. The idea behind the Guardian interactive was to take something abstract, like a big number of people, and compare it to something more spatially familiar and tangible to help drive it home and make it real.

Researchers Fanny ChevalierRomain Vuillemot, and  Guia Gali have been studying the use of such concrete scales in visualization and recently published a paper detailing some of the challenges and practical steps we can use to more effectively employ these kinds of scales in data journalism and data visualization.

In the paper they describe a few different strategies for making concrete scales, including unitization, anchoring, and analogies. Shown in the figure below, (a) unitization is the idea of re-expression one object in terms of a collection of objects that may be more familiar (e.g. the mass of Saturn is 97 times that of Earth); (b) anchoring uses a familiar object, like the size of a match head, to make the size of another unfamiliar object (e.g. a tick in this case) more concrete; and (c) analogies make parallel comparisons to familiar objects (e.g. atom is to marble, as human head is to earth).

All of these techniques are really about visual comparison to the more familiar. But the familiar isn’t necessarily exact. For instance, if I were to compare the height of the Empire State Building to a number of people stacked up, I would need to use the average height of a person, which is really an idealized approximation. So it’s important to think about the precision of the visual comparisons you might be setting up with concrete scales.

Another strategy often used with concrete scales is containment, which can be useful to communicate impalpable volumes or collections of material. For example you might want to make visible the amount of sugar in different sizes of soda bottles by filling plastic bags with different amounts of granular sugar. Again, this is an approximate comparison but also makes it more familiar and material.

So, how can you design data visualizations to effectively use concrete scales? First you should ask if it’s an unfamiliar unit or whether it has an extreme magnitude that would make it difficult to comprehend. Then you need to find a good comparison unit that is more familiar to people. Does it make sense to unitize, anchor, or use an analogy? And if you use an anchor or container, which one should you choose? The answers to these questions will depend on your particular design situation as well as the semantics of the data you’re working with. A number of examples that the researchers have tagged are available online.

The individual nature of “what is familiar” does beg the question about personalization of concrete scales too. Michael Keller’s work for Al Jazeera lets you compare the number of refugees from the Syrian conflict to a geographic extent in the US, essentially letting the user’s own familiarity with geography guide what area they want to compare as an anchor. What if this type of personalization could also be automated? Consider logging into Facebook or Twitter and the visualization adapting to use concrete scales to the places, objects, or organizations you’re most familiar with based on your profile information. This type of automated visualization adaptation could help make such visual depictions of data much more personally relevant and interesting.

Even though concrete scales are often used in data visualizations in the media it’s worth also realizing that there are some open questions too. How do we define whether an anchor or unit is “familiar” or not, and what makes one concrete unit better than another? Perhaps some scales make people feel like they understand the visualization better or help the reader remember the visualization better. There are still many open questions for empirical research.

Storytelling with Data Visualization: Context is King

Note: A version of the following also appears on the Tow Center blog.

Data is like a freeze-dried version of reality, abstracted sometimes to the point where it can be hard to recognize and understand. It needs some rehydration before it becomes tasty (or even just palatable) storytelling material again — something that visualization can often help with. But to fully breathe life back into your data, you need to crack your knuckles and add a dose of written explanation to your visualizations as well. Text provides that vital bit of context layered over the data that helps the audience come to a valid interpretation of what it really means.

So how can you use text and visualization together to provide that context and layer a story over your data? Some recently published research by myself and collaborators at the University of Michigan offers some insights.

In most journalistic visualization, context is added to data visualization through the use of labels, captions, and other annotations — texts — of various kinds. Indeed, on the Economist Graphic Detail blog, visualizations not only have integrated textual annotations, but an entire 1-2 paragraph introductory article associated with them. In addition to adding an angle and story to the piece, such contextual journalism helps flesh out what the data means and guides the reader’s interpretation towards valid inferences from the data. Textual annotations integrated directly with a visualization can further guide the users’ interactions, emphasizing certain points, prioritizing particular interpretations of data, or pre-empting the user’s curiosity on seeing a salient outlier, aberration, or trend.

To answer the question of how textual annotations function as story contextualizers in online news visualization we analyzed 136 professionally made news visualizations produced by the New York Times and the Guardian between 2000 and July 2012. Of course we found text used for everything from axes labels, author information, sources, and data provenance, to instructions, definitions, and legends, but we were were less interested in studying these kinds of uses than in annotations that were more related to data storytelling.

Based on our analysis we recognized two underlying functions for annotations: (1) observational, and (2) additive. Observational annotations provide context by supporting reflection on a data value or group of values that are depicted in the visualization. These annotations facilitate comparisons and often highlight or emphasize extreme values or other outliers. For interactive graphics they are sometimes revealed when hovering over a visual element.

A basic form of observational messaging is apparent in the following example from the New York Times, showing the population pyramid in the U.S. On the right of the graphic text clearly indicates observations of the total number and fraction of the population expected to be over age 65 by 2015. This is information that can be observed in the graph but is being reinforced through the use of text.

Another example from the Times shows how observational annotations can be used to highlight and label extremes on a graph. In the chart below, the U.S. budget forecast is depicted, and the low point of 2010 is highlighted with a yellow circle together with an annotation. The value and year of that point are already visible in the graph, which is what makes this kind of annotation observational. Consider using observational annotations when you want to underscore something that’s visible in the visualization, but which you really want to make sure the user sees, or when there is an interesting comparison that you would like to draw the user’s attention towards.

On the other hand, additive annotation provides context that is external to the visual representation and not clearly depicted via the data. These are things that are relevant to the topic or to understanding the data, like background or contemporaneous events or actions. It’s up to you to decide which dimensions of who, what, where, when, why, and how are relevant. If you think the viewer needs to be aware of something in order to interpret the data correctly, then an additive annotation might be appropriate.

The following example from The Minneapolis Star Tribune shows changes in home prices across counties in Minnesota with reference to the peak of the housing bubble, a key bit of additive annotation attached to the year 2007. At the same time, the graphic also uses observational annotation (on the right side) by labeling the median home price and percent change since 2007 for the selected county.

Use of these types of annotation is very prevalent; in our study of 136 examples we found 120 (88.2%) used at least one of these forms of annotation. We also looked at the relative use of each, shown in the next figure. Observational annotations were used in just shy of half of the cases, whereas additive were used in 73%.

Another dimension to annotation is what scope of the visualization is being referenced: an individual datum, a group of data, or the entire view (e.g. a caption-like element). We tabulated the prevalence of these annotation anchors and found that single datum annotations are the most frequently used (74%). The relative usage frequencies are shown in the next figure. Your choice of what scope of the visualization to annotate will often depend on the story you want to tell, or on what kinds of visual features are most visually salient, such as outliers, trends, or peaks. For instance, trends that happen over longer time-frames in a line-graph might benefit from a group annotation to indicate how a collection of data points is trending, whereas a peak in a time-series would most obviously benefit from an annotation calling out that specific data point.

The two types of annotation, and three types of annotation anchoring are summarized in the following chart depicting stock price data for Apple. Annotations A1 and A2 show additive annotations attached to the whole view, and to a specific date in the view, whereas O1 and O2 show observational annotations attached to a single datum and a group of data respectively.

As we come to better understand how to tell stories with text and visualization together, new possibilities also open up for how to integrate text computationally or automatically with visualization.

In our research we used the above insights about how annotations are used by professionals to build a system that analyzes a stock time series (together with its trade volume data) to look for salient points and automatically annotate the series with key bits of additive context drawn from a corpus of news articles. By ranking relevant news headlines and then deriving graph annotations we were able to automatically generate contextualized stock charts and create a user-experience where users felt they had a better grasp of the trends and oscillations of the stock.

On one hand we have the fully automated scenario, but in the future, more intelligent graph authoring tools for journalists might also incorporate such automation to suggest possible annotations for a graph, which an editor could then tweak or re-write before publication. So not only can the study of news visualizations help us understand the medium better and communicate more effectively, but it can also enable new forms of computational journalism to emerge. For all the details please see our research paper, “Contextifier: Automatic Generation of Annotated Stock Visualizations.”

Algorithmic Defamation: The Case of the Shameless Autocomplete

Note: A version of the following also appears on the Tow Center blog.

In Germany, a man recently won a legal battle with Google over the fact that when you searched for his name, the autocomplete suggestions connected him to “scientology” and “fraud,” — two things that he felt had defamatory insinuations. As a result of losing the case, Google is now compelled to remove defamatory suggestions from autocomplete results when notified, in Germany at least.

Court cases arising from autocomplete defamation aren’t just happening in Germany though. In other European countries like Italy, France, and Ireland, to as wide afield as Japan and Australia people (and corporations) have brought suit alleging these algorithms defamed them by linking their names to everything from crime and fraud to bankruptcy or sexual conduct. In some cases such insinuations can have real consequences for finding jobs or doing business. New services, such as brand.com’s “Google Suggest Plan” have even arisen to help people manipulate and thus avoid negative connotations in search autocompletions.

The Berkman Center’s Digital Media Law Project (DMLP) defines a defamatory statement generally as, “a false statement of fact that exposes a person to hatred, ridicule or contempt, lowers him in the esteem of his peers, causes him to be shunned, or injures him in his business or trade.” By associating a person’s name with some unsavory behavior it would seem indisputable that autocomplete algorithms can indeed defame people.

So if algorithms like autocomplete can defame people or businesses, our next logical question might be to ask how to hold those algorithms accountable for their actions. Considering the scale and difficulty of monitoring such algorithms, one approach would be to use more algorithms to keep tabs on them and try to find instances of defamation hidden within their millions (or billions) of suggestions.

To try out this approach I automatically collected data on both Google and Bing autocompletions for a number of different queries relating to public companies and politicians. I then filtered these results against keyword lists relating to crime and sex in order to narrow in on potential cases of defamation. I used a list of the corporations on the S&P 500 to query the autocomplete APIs with the following templates, where “X” is the company name: “X,” “X company,” “X is,” “X has,” “X company is,” and “X company has.” And I used a list of U.S. congresspeople from the Sunlight Foundation to query for each person’s first and last name, as well as adding either “representative” or “senator” before their name. The data was then filtered using a list of sex-related keywords, and words related to crime collected from the Cambridge US dictionary in order to focus on a smaller subset of the almost 80,000 autosuggestions retrieved.

Among the corporate autocompletions that I filtered and reviewed, there were twenty-four instances that could be read as statements or assertions implicating the company in everything from corruption and scams to fraud and theft. For instance, querying Bing for “Torchmark” returns as the second suggestion, “torchmark corporation job scam.” Without really digging deeply it’s hard to tell if Torchmark corporation is really involved in some form of scam, or if there’s just some rumors about scam-like emails floating around. If those rumors are false, this could indeed be a case of defamation against the company. But this is a dicey situation for Bing, since if they filtered out a rumor that turned out to be true it might appear they were trying to sweep a company’s unsavory activities under the rug. People would ask: Is Bing trying to protect this company? At the same time they would be doing a disservice to their users by not steering them clear of a scam.

While looking through the autocompletions returned from querying for congresspeople it became clear that a significant issue here relates to name collisions. For relatively generic congressperson names like “Gerald Connolly” or “Joe Barton” there are many other people on the internet with the same names. And some of those people did bad things. So when you Google for “Gerald Connolly” one suggestion that comes up is “gerald connolly armed robbery,” not because Congressman Gerald Connolly robbed anyone but because someone else in Canada by the same name did. If you instead query for “representative Gerald Connolly” the association goes away; adding “representative” successfully disambiguates the two Connollys. The search engine has it tough though: Without a disambiguating term how should it know you’re looking for the congressman or a robber? There are other cases that may be more clear-cut instances of defamation, such as on Bing “Joe Barton” suggesting “joe barton scam” which was not corrected when adding the title “representative” to the front of the query. That seems to be more of a legitimate instance of defamation since even with the disambiguation it’s still suggesting the representative is associated with a scam. And with a bit more searching around it’s also clear there is a scam related to a Joe Barton, just not the congressman.

Some of the unsavory things that might hurt someone’s reputation in autocomplete suggestions could be true though. For instance, an autocompletion for representative “Darrell Issa” to “Darrell Issa car theft” is a correct association arising from his involvement with three separate car theft cases (for which his brother ultimately took the rap). To be considered defamation, the statement must actually be false, which makes it that much harder to write an algorithm that can find instances of real defamation. Unless algorithms can be developed that can detect rumor and falsehood, you’ll always need a person assessing whether an instance of potential defamation is really valid. Still, such tips on what might be defamatory can help filter and focus attention.

Understanding defamation from a legal standpoint brings in even more complexity. Even something that seems, from a moral point of view, defamatory might not be considered so by a court of law. Each state in the U.S. is a bit different in how it governs defamation. A few key nuances relevant to the court’s understanding of defamation relate to perception and intent.

First of all, a statement must be perceived as fact and not opinion in order to be considered defamation by the court. So how do people read search autocompletions? Do they see them as collective opinions or rumors reflecting the zeitgeist, or do they perceive them as statements of fact because of their framing as a result from an algorithm? As far as I know this is an open question for research. If autocompletions are read as opinion, then it might be difficult to ever win a defamation case in the U.S. against such an algorithm.

For defamation suits against public figures intent also becomes an important factor to consider. The plaintiff must prove “actual malice” with regards to the defamatory statement, which means that a false statement was published either with actual knowledge of its falsity, or reckless disregard for its falsity. But can an algorithm ever be truly malicious? If you use the argument that autocompletions are just aggregations of what others have already typed in, then actual malice could certainly arise from a group of people systematically manipulating the algorithm. Otherwise, the algorithm would have to have some notion of truth, and be “aware” that it was autocompleting something inconsistent with its knowledge of that truth. This could be especially challenging for things who’s truth changes over time, or for rumors which may have a social consensus but still be objectively false. So while there have been attempts at automating factchecking I think this is a far way off.

Of course this may all be moot under Section 230 of the Communications Decency Act, which states that, “no provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider.” Given that search autocompletions are based on queries that real people at one time typed into a search box, it would seem Google has a broad protection under the law against any liability from republishing those queries as suggestions. It’s unclear though, at least to me, if recombining and aggregating data from millions of typed queries can really be considered “re-publishing” or if it should rather be considered publishing anew. I suppose it would depend on the degree of transformation of the input query data into suggestions.

Whether it’s Google’s algorithms creating new snippets of text as autocomplete suggestions, or Narrative Science writing entire articles from data, we’re entering a world where algorithms are synthesizing communications that may in some cases run into moral (or legal) considerations like defamation. In print we call defamation libel; when orally communicated we call it slander. We don’t yet have a word for the algorithmically reconstituted defamation that arises when millions of non-public queries are synthesized and publicly published by an aggregative intermediary. Still, we might try to hold such algorithms to account, by using yet more algorithms to systematically assess and draw human attention to possible breaches of trust. It may be some time yet, if ever, when we can look to the U.S. court system for adjudication.

Sex, Violence, and Autocomplete Algorithms: Methods and Context

In my Slate article “Sex, Violence, and Autocomplete Algorithms,” I use a reverse-engineering methodology to better understand what kinds of queries get blocked by Google and Bing’s autocomplete algorithms. In this post I want to pull back the curtains a bit to talk about my process as well as add some context to the data that I gathered for the project.

To measure what kinds of sex terms get blocked I first found a set of sex-related words that are part of a larger dictionary called LIWC (Linguistic Inquiry and Word Count) which includes painstakingly created lists of words for many different concepts like perception, causality, and sex among others. It doesn’t include a lot of slang though, so for that I augmented my sex-word list with some more gems pulled from the Urban Dictionary, resulting in a list of 110 words. The queries I tested included the word by itself, as well as in the phrase “child X” in an attempt to identify suggestions related to child pornography.

For the violence-related words that I tested, I used a set of 348 words from the Random House “violent actions” list, which includes everything from the relatively innocuous “bop” to the more ruthless “strangle.” To construct queries I put the violent words into two phrases: “How to X” and “How can I X.”

Obviously there are many other words and permutations of query templates that I might have used. One of the challenges with this type of project is how to sample data and where to draw the line on what to collect.

With lists of words in hand the next step was to prod the APIs of Google and Bing to see what kind of autocompletions were returned (or not) when queried. The Google API for autocomplete is undocumented, though I found and used some open-source code that had already reverse engineered it. The Bing API is similarly undocumented, but a developer thread on the Bing blog mentions how to access it. I constructed each of my query words and templates and, using these APIs, recorded what suggestions were returned.

An interesting nuance to the data I collected is that both APIs return more responses than actually show up in either user interface. The Google API returns 20 results, but only shows 4 or 10 in the UI depending on how preferences are set. The Bing API returns 12 results but only shows 8 in the UI. Data returned from the API that never appears in the UI is less interesting since users will never encounter it in their daily usage. But, I should mention that it’s not entirely clear what happens with the API results that aren’t shown. It’s possible some of them could be shown during the personalization step of the algorithm (which I didn’t test).

The queries were run and data collected on July 2nd, 2013, which is important to mention since these services can change without notice. Indeed, Google claims to change its search algorithm hundreds of times per year. Autocomplete suggestions can also vary by geography or according to who’s logged in. Since the APIs were accessed programmatically, and no one was logged in, none of the results collected reflect any personalization that the algorithm performs. However, the results may still reflect geography since figuring out where your computer is doesn’t require a log in. The server I used to collect data is located in Delaware. It’s unclear how Google’s “safe search” settings might have affected the data I collected via their API. The Bing spokesperson I was in touch with wrote, “Autosuggest adheres to a ‘strict’ filter policy for all suggestions and therefore applies filtering to all search suggestions, regardless of the SafeSearch settings for the search results page.”

In the spirit of full transparency, here is a .csv to all of the queries and responses that I collected.

The Rhetoric of Data

Note: A version of the following also appears on the Tow Center blog.

In the 1830’s abolitionists discovered the rhetorical potential of re-conceptualizing southern newspaper advertisements as data. They “took an undifferentiated pile of ads for runaway slaves, wherein dates and places were of primary importance … and transformed them into data about the routine and accepted torture of enslaved people,” writes Ellen Gruber Garvey in the book Raw Data is an Oxymoron. By creating topical dossiers of ads, the horrors of slavery were catalogued and made accessible for writing abolitionist speeches and novels. The South’s own media had been re-contextualized into a persuasive weapon against itself, a rhetorical tool to bolster the abolitionists’ arguments.

The Latin etymology of “data” means “something given,” and though we’ve largely forgotten that original definition, it’s helpful to think about data not as facts per se, but as “givens” that can be used to construct a variety of different arguments and conclusions; they act as a rhetorical basis, a premise. Data does not intrinsically imply truth. Yes we can find truth in data, through a process of honest inference. But we can also find and argue multiple truths or even outright falsehoods from data.

Take for instance the New York Times interactive, “One Report, Diverging Perspectives,” which wittingly highlights this issue. Shown below, the piece visualizes jobs and unemployment data from two perspectives, emphasizing the differences in how a democrat or a republican might see and interpret the statistics. A rising tide of “data PR” often manifesting as slick and pointed infographics won’t be so upfront about the perspectives being argued though. Advocacy organizations can now collect their own data, or just develop their own arguments from existing data for supporting their cause. What should you be looking out for as a journalist when assessing a piece of data PR? And how can you improve your own data journalism by ensuring the argument you develop is a sound one?

one report diverging perspectives

Contextual journalism—adding interpretation or explanation to a story—can and should be applied to data as much as to other forms of reporting. It’s important because the audience may need to know the context of a dataset in order to fully understand and evaluate the larger story in perspective. For instance, context might include explaining how the data was collected, defined, and aggregated, and what human decision processes contributed to its creation. Increasingly news outlets are providing sidebars or blog posts that fully describe the methodology and context of the data they use in a data-driven story. That way the context doesn’t get in the way of the main narrative but can still be accessed by the inquisitive reader.

In your process it can be useful to ask a series of contextualizing questions about a dataset, whether just critiquing the data, or producing your own story.

Who produced the data and what was their intent? Did it come from a reputable source, like a government or inter-governmental agency such as the UN, or was it produced by a third party corporation with an uncertain source of funding? Consider the possible political or advocacy motives of a data provider as you make inferences from that data, and do some reporting if those motives are unclear.

When was the data collected? Sometimes there can be temporal drift in what data means, how it’s measured, or how it should be interpreted. Is the age of your data relevant to your interpretation? For example, in 2010 the Bureau of Labor Statistics changed the definition of long-term unemployment, which can make it important to recognize that shift when comparing data from before and after the change.

Most importantly it’s necessary to ask what is measured in the data, how was it sampled, and what is ultimately depicted? Are data measurements defined accurately and in a way that they can be consistently measured? How was the data sampled from the world? Is the dataset comprehensive or is it missing pieces? If the data wasn’t randomly sampled how might that reflect a bias in your interpretation? Or have other errors been introduced into the data, for instance through typos or mistaken OCR technology? Is there uncertainty in the data that should be communicated to the reader? Has the data been cropped or filtered in a way that you have lost a potentially important piece of context that would change its interpretation? And what about aggregation or transformation? If a dataset is offered to you with only averages or medians (i.e. aggregations) you’re necessarily missing information about how the data might be distributed, or about outliers that might make interesting stories. For data that’s been transformed through some algorithmic process, such as classification, it can be helpful to know the error rates of that transformation as this can lead to additional uncertainty in the data.

Let’s consider an example that illustrates the importance of measurement definition and aggregation. The Economist graphic below shows the historic and forecast vehicle sales for different geographies. The story the graph tells is pretty clear: Sales in China are rocketing up while they’re declining or stagnant in North America and Europe. But look more closely. The data for Western Europe and North America is defined as an aggregation of light vehicle sales, according to the note in the lower-right corner. How would the story change if the North American data included truck, SUV, and minivan sales? The story you get from these kinds of data graphics can depend entirely on what’s aggregated (or not aggregated) together in the measure. Aggregations can serve as a tool of obfuscation, whether intentional or not.

 vehicle sales

It’s important to recognize and remember that data does not equal truth. It’s rhetorical by definition and can be used for truth finding or truth hiding. Being vigilant in how you develop arguments from data and showing the context that leads to the interpretation you make can only help raise the credibility of your data-driven story.

 

Data on the Growth of CitiBike

On May 27th New York City launched its city-wide bike sharing program, CitiBike. I tried it out last weekend; it was great, aside from a few glitches checking-out and checkin-in the bikes. It made me curious about the launch of the program and how it’s growing, especially since the agita between bikers and drivers is becoming quite palpable. Luckily, the folks over at the CitiBike blog have been posting daily stats about the number of rides every day, average duration of rides, and even the most popular station for starting and stopping a ride. If you’re interested in hacking more on the data there’s even a meetup happening next week.

Below is my simple line chart of the total number of daily riders (they measure that as of 5pm that day). Here’s the data. You might look at the graph and wonder, “what happened June 7th?”. That was the monsoon we had. Yeah, turns out bikers don’t like rain.

citibike2

51% Foreign: Algorithms and the Surveillance State

In New York City there’s a “geek squad” of analysts that gathers all kinds of data, from restaurant inspection grades and utility usage to neighborhood complaints, and uses it to predict how to improve the city. The idea behind the team is that with more and more data available about how the city is running—even if it’s messy, unstructured, and massive—the government can optimize its resources by keeping an eye out for what needs its attention most. It’s really about city surveillance, and of course acting on the intelligence produced by that surveillance.

One story about the success of the geek squad comes to us from Viktor Mayer-Schonberger and Kenneth Cukier in their book “Big Data”. They describe the issue of illegal real-estate conversions, which involves sub-dividing an apartment into smaller and smaller units so that it can accommodate many more people than it should. With the density of people in such close quarters, illegally converted units are more prone to accidents, like fire. So it’s in the city’s—and the public’s—best interest to make sure apartment buildings aren’t sub-divided like that. Unfortunately there aren’t very many inspectors to do the job. But by collecting and analyzing data about each apartment building the geek squad can predict which units are more likely to pose a danger, and thus determine where the limited number of inspectors should focus their attention. Seventy percent of inspections now lead to eviction orders from unsafe dwellings, up from 13% without using all that data—a clear improvement in helping inspectors focus on the most troubling cases.

Consider a different, albeit hypothetical, use of big data surveillance in society: detecting drunk drivers. Since there are already a variety of road cameras and other traffic sensors available on our roads, it’s not implausible to think that all of this data could feed into an algorithm that says, with some confidence, that a car is exhibiting signs of erratic, possibly drunk driving. Let’s say, similar to the fire-risk inspections, that this method also increases the efficiency of the police department in getting drunk drivers off the road—a win for public safety.

But there’s a different framing at work here. In the fire-risk inspections the city is targeting buildings, whereas in the drunk driving example it’s really targeting the drivers themselves. This shift in framing—targeting the individual as opposed to the inanimate–crosses the line into invasive, even creepy, civil surveillance.

So given the degree to which the recently exposed government surveillance programs target individual communications, it’s not as surprising that, according to Gallup, more Americans disapprove (53%) than approve (37%) of the federal government’s program to “compile telephone call logs and Internet communications.” This is despite the fact that such surveillance could in a very real way contribute to public safety, just as with the fire-risk or drunk driving inspections.

At the heart of the public’s psychological response is the fear and risk of surveillance uncovering personal communication, of violating our privacy. But this risk is not a foregone conclusion. There’s some uncertainty and probability around it, which makes it that much harder to understand the real risk. In the Prism program, the government surveillance program that targets internet communications like email, chats, and file transfers, the Washington Post describes how analysts use the system to “produce at least 51 percent confidence in a target’s ‘foreignness’”. This test of foreignness is tied to the idea that it’s okay (legally) to spy on foreign communications, but that it would breach FISA (the Foreign Intelligence Surveillance Act), as well as 4th amendment rights for the government to do the same to American citizens.

Platforms used by Prism, such as Google and Facebook, have denied that they give the government direct access to their servers. The New York Times reported that the system in place is more like having a locked mailbox where the platform can deposit specific data requested pursuant to a court order from the Foreign Intelligence Surveillance Court. But even if such requests are legally targeted at foreigners and have been faithfully vetted by the court, there’s still a chance that ancillary data on American citizens will be swept up by the government. “To collect on a suspected spy or foreign terrorist means, at minimum, that everyone in the suspect’s inbox or outbox is swept in,” as the Washington Post writes. And typically data is collected not just of direct contacts, but also contacts of contacts. This all means that there’s a greater risk that the government is indeed collecting data on many Americans’ personal communications.

Algorithms, and a bit of transparency on those algorithms, could go a long way to mitigating the uneasiness over domestic surveillance of personal communications that American citizens may be feeling. The basic idea is this: when collecting information on a legally identified foreign target, for every possible contact that might be swept up with the target’s data, an automated classification algorithm can be used to determine whether that contact is more likely to be “foreign” or “American”. Although the algorithm would have access to all the data, it would only output one bit of metadata for each contact: is the contact foreign or not? Only if the contact was deemed highly likely to be foreign would the details of that data be passed on to the NSA. In other words, the algorithm would automatically read your personal communications and then signal whether or not it was legal to report your data to intelligence agencies, much in the same way that Google’s algorithms monitor your email contents to determine which ads to show you without making those emails available for people at Google to read.

The FISA court implements a “minimization procedure” in order to curtail incidental data collection from people not covered in the order, though the exact process remains classified. Marc Ambinder suggests that, “the NSA automates the minimization procedures as much as it can” using a continuously updated score that assesses the likelihood that a contact is foreign.  Indeed, it seems at least plausible that the algorithm I suggest above could already be a part of the actual minimization procedure used by NSA.

The minimization process reduces the creepiness of unfettered government access to personal communications, but at the same time we still need to know how often such a procedure makes mistakes. In general there are two kinds of mistakes that such an algorithm could make, often referred to as false positives and false negatives. A false negative in this scenario would indicate that a foreign contact was categorized by the algorithm as an American. Obviously the NSA would like to avoid this type of mistake since it would lose the opportunity to snoop on a foreign terrorist. The other type of mistake, false positive, corresponds to the algorithm designating a contact as foreign even though in reality it’s American. The public would want to avoid this type of mistake because it’s an invasion of privacy and a violation of the 4th amendment. Both of these types of errors are shown in the conceptual diagram below, with the foreign target marked with an “x” at the center and ancillary targets shown as connected circles (orange is foreign, blue is American citizen).

diagram

It would be a shame to disregard such a potentially valuable tool simply because it might make mistakes from time to time. To make such a scheme work we first need to accept that the algorithm will indeed make mistakes. Luckily, such an algorithm can be tuned to make more or less of either of those mistakes. As false positives are tuned down false negatives will often increase, and vice versa. The advantage for the public would be that it could have a real debate with the government about what magnitude of mistakes is reasonable. How many Americans being labeled as foreigners and thus subject to unwarranted search and seizure is acceptable to us? None? Some? And what’s the trade-off in terms of how many would-be terrorists might slip through if we tuned the false positives down?

To begin a debate like this the government just needs to tell us how many of each type of mistake its minimization procedure makes; just two numbers. In this case, minimal transparency of an algorithm could allow for a robust public debate without betraying any particular details or secrets about individuals. In other words, we don’t particularly need to know the gory details of how such an algorithm works. We simply need to know where the government has placed the fulcrum in the tradeoff between these different types of errors. And by implementing smartly transparent surveillance maybe we can even move more towards the world of the geek squad, where big data is still ballyhooed for furthering public safety.

To Save Everything, Deliberate it Endlessly?

Evgeny Morozov’s book To Save Everything, Click Here is a worthwhile tour de force of technology criticism that will have you double-taking on everything you hold near and dear about the Internet. The book’s basic premise is that of a polemic against the ideas of “solutionism” (i.e., the tendency to apply efficiency-oriented, engineering fixes to societal problems) and “internet centrism” (i.e., the treatment of the internet as an infallible, ever-positive force on humanity). He covers a gamut, raising flags of caution and moral suspicion on everything from openness and transparency, to algorithms in the media, predictive policing, the quantified self, nudging, and gamification, among many others. Sardonic and bombastic as it sometimes reads, it’s quite well-written, wittingly exposing some useful critiques of our modern techno-lust culture.

As Morozov deftly points out through his many examples, once we realize that designed technologies embed values and moral judgements, we can begin to make decisions about our designed environment and society that reflect the values and morals that we deem respectful to humanity, not just for corporations or other stakeholders. He’s on the side of the people! It’s really about human dignity in the way our designed world influences both individual and collective behavior. This main thread of thinking reminds me of Batya Friedman’s work on value-sensitive design, which attempts to account for human values in a comprehensive manner throughout the design process by identifying stakeholders, benefits, values, and value conflicts to help inform design decisions.

Unfortunately the internal consistency of the book comes under some tension during the last couple chapters, when Morozov tackles the issues of nudging, the information diet, and his own solution to encouraging more deliberation and reflection.

Morozov positions nudging as “solutionism by other means.” He argues that to nudge assumes a social consensus, which may or may not in fact exist, both in terms of what is nudged as well as in which direction. The nudge assumes something is askew, which can and should be brought back into harmony. One nudge you might consider is to encourage the public to consume a more nutritious “information diet” (a la Clay Johnson’s book of the same title). But Morozov positions Johnson’s ideas as “a fairly traditional critique of how the public allocates attention to news,” the end result of which espouses the ideal that citizens should stay informed about every possible issue—clearly an impossibility. The reality, if you agree with Walter Lippmann, reads differently: citizens don’t want to know everything about everything, nor do they have time to, which is why they delegate. In critiquing nudging and the idea of the omniscient citizen, Morozov sides with Lippmann: nudging people to be experts on everything is futile.

But this is where we find the tension with what is offered in the final chapter of the book. The “solution” proffered for “solutionism” and “internet centrism” is to replace the “fetish for psychology” with a penchant for moral and political philosophy and a desire to encourage healthy reflective deliberation by everyday users on the designs of technologies affecting society. I do agree with the general desire for more reflection in the technologies we build. But to suggest, as he does, that to do so we should design technologies to encourage users to be more reflective and deliberative is still just nudging. Moreover, his rejection of omnicompetence contradicts his argument for nudging citizens to be more deliberative, since  how could we expect citizens to be expert and care enough to deliberate on everything?  Criticizing nudging and omnicompetence and then offering them as a way forward suggests that Morozov’s real gripe is that the values embedded in nudging as well as the solutions offered by silicon valley, and indeed the internet itself are simply not his own.

Just as not every citizen is part of every public that emerges around an issue, not every citizen needs to reflect and deliberate on every given technology in society. The interested parties will deliberate, then a design will be fashioned, and the rest of society will delegate to that design, or any number of other designs. Putting on my user-experience designer hat, I believe that incessantly confronting end-users with philosophical dilemma will ultimately prove unproductive in many contexts; people need to actually use these things, to accomplish real tasks. Can you imagine the design of an airline cockpit that constantly confronts the pilot with philosophical choices? Crash. It’s true that, in Morozov’s words, “We need to develop a better way of evaluating, comparing, and discriminating across technological fixes,” but the locus for that activity  will often fall on the design-side of the equation. Detailed design rationale can then make this accountable and legible to any interested public that may emerge.

Under some circumstances it may indeed make sense to facilitate additional reflection in users, but what’s lacking in the book is a solid treatment of the limitations of Morozov’s approach. When should we design for deliberation, and when should we design for efficiency? Morozov has shown us some of the things we miss when we over-emphasize design for efficiency, but not, unfortunately, what we may miss by overemphasizing design for deliberation.

Storytelling with Data: What Are the Impacts on the Audience?

Storytelling with data visualization is still very much in its “Wild West” phase, with journalism outlets blazing new paths in exploring the burgeoning craft of integrating the testimony of data together with compelling narrative. Leaders such as The News York Times create impressive data-driven presentations like 512 Paths to the White House (seen above) that weave complex information into a palatable presentation. But as I look out at the kinds of meetings where data visualizers converge, like EyeoTapestryOpenVis, and the infographics summit Malofiej, I realize there’s a whole lot of inspiration out there, and some damn fine examples of great work, but I still find it hard to get a sense of direction — which way is West, which way to the promised land?

And it occurred to me: We need a science of data-visualization storytelling. We need some direction. We need to know what makes a data story “work”. And what does a data story that “works” even mean?

Examples abound, and while we have theories for color use, visual salience and perception, and graph design that suggest how to depict data efficiently, we still don’t know, with any particular scientific rigor, which are better stories. At the Tapestry conference, where I attended, journalists such as Jonathan CorumHannah Fairfield, and Cheryl Phillips whipped out a staggering variety of examples in their presentations. Jonathan, in his keynote, talked about “A History of the Detainee Population” an interactive NYT graphic (partially excerpted below) depicting how Guantanamo prisoners have, over time, slowly been moved back to their country of origin. I would say that the presentation is effective. I “got” the message. But I also realize that, because the visualization is animated, it’s difficult to see the overall trend over time — to compare one year to the next. There are different ways to tell this story, some of which may be more effective than others for a range of storytelling goals.

guantanamo

Critical blogs such as The Why Axis and Graphic Sociology have arisen to try to fill the gap of understanding what works and what doesn’t. And research on visualization rhetoric has tried to situate narrative data visualization in terms of the rhetorical techniques authors may use to convey their story. Useful as these efforts are in their thick description and critical analysis, and for increasing visual literacy, they don’t go far enough toward building predictive theories of how data-visualization stories are “read” by the audience at large.

Corum, a graphics editor at NYT, has a descriptive framework to explain his design process and decisions. It describes the tensions between interactivity and story, between oversimplification and overwhelming detail, and between exploration and decoration. Other axes of design include elements such as focus versus depth and the author versus the audience. Author and educator Alberto Cairo exhibits similar sets of design dimensions in his book, “The Functional Art“, which start to trace the features along which data-visualization stories can vary (recreated below).

vis wheel

Such descriptions are a great starting point, but to make further progress on interactive data storytelling we need to know which of the many experiments happening out in the wild are having their desired effect on readers. Design decisions like how and where annotations are placed on a visualization, how the story is structured across the canvas and over time, the graphical style including things like visual embellishments and novelties, as well as data mapping and aggregation can all have consequences on how the audience perceives the story. How does the effect on the audience change when modulating these various design dimensions? A science of data-visualization storytelling should seek to answer that question.

But still the question looms: What does a data story that “works” even mean? While efficiency and parsimony of visual representation may still be important in some contexts, I believe online storytelling demands something else. What effects on the audience should we measure? As data visualization researcher Robert Kosara writes in his forthcoming IEEE Computer article on the subject, “there are no clearly defined metrics or evaluation methods … Developing these will require the definition of, and agreement on, goals: what do we expect stories to achieve, and how do we measure it?”

There are some hints in recent research in information visualization for how we might evaluate visualizations that communicate or present information. We might for instance ask questions about how effectively a message is acquired by the audience: Did they learn it faster or better? Was is memorable, or did they forget it 5 minutes, 5 hours, or 5 weeks later? We might ask whether the data story spurred any personal insights or questions, and to what degree users were “engaged” with the presentation. Engaged here could mean clicks and hovers of the mouse on the visualization, how often widgets and filters for the presentation were touched, or even whether users shared or conversed around the visualization. We might ask if users felt they understood the context of the data and if they felt confident in their interpretation of the story: Did they feel they could make an informed decision on some issue based on the presentation? Credibility being an important attribute for news outlets, we might wonder whether some data story presentations are more trustworthy than others. In some contexts a presentation that is persuasive is the most important factor. Finally, since some of the best stories are those that evoke emotional responses, we might ask how to do the same with data stories.

Measuring some of these factors is as straightforward as instrumenting the presentations themselves to know where users moved their mouse, clicked, or shared. There are a variety of remote usability testing services that can already help with that. Measuring other factors might require writing and attaching survey questions to ask users about their perceptions of the experience. While the best graphics departments do a fair bit of internal iteration and testing it would be interesting to see what they could learn by setting up experiments that varied their designs minutely to see how that affected the audience along any of the dimensions delineated above. More collaboration between industry and academia could accelerate this process of building knowledge of the impact of data stories on the audience.

I’m not arguing that the creativity and boundary-pushing in data-visualization storytelling should cease. It’s inspiring looking at the range of visual stories that artists and illustrators produce. And sometimes all you really want is an amuse yeux — a little bit of visual amusement. Let’s not get rid of that. But I do think we’re at an inflection point where we know enough of the design dimensions to start building models of how to reliably know what story designs achieve certain goals for different kinds of story, audience, data, and context. We stand only to be able to further amplify the impact of such stories by studying them more systematically.