Diversity in the Robot Reporter Newsroom

bots

The Associated Press recently announced a big new hire: A robot reporter from Automated Insights (AI) would be employed to write up to 4,400 earnings report stories per quarter. Last year, that same automated writing software produced over 300 million stories — that’s some serious scale from a single algorithmic entity.

So what happens to media diversity in the face of massive automated content production platforms like the one Automated Insights created? Despite the fact that we’ve done pretty abysmally at incorporating a balance of minority and gender perspectives in the news media, I think we’d all like to believe that by including diverse perspectives in the reporting and editing of news we fly closer to the truth. A silver lining to the newspaper industry crash has been a profusion of smaller, more nimble media outlets, allowing for far more variability and diversity in the ideas that we’re exposed to.

Of course software has biases and although the basic anatomy of robot journalists is comparable, there are variations within and amongst different systems such as the style and tone that’s produced as well as the editorial criteria that are coded into the systems. Algorithms are the product of a range of human choices including various criteria, parameters, or training data that can also pass along inherited, systematic biases. So while a robot reporter offers the promise of scale (and of reducing costs), we need to consider an over-reliance on any one single automated system. For the sake of media diversity the one bot needs to fork itself and become 100,000.

We saw this in microcosm unfold over the last week. The @wikiparliament bot was launched in the UK to monitor edits to Wikipedia from IP addresses within parliament (a form of transparency and accountability for who was editing what). Within days it had been mimed by the @congressedits bot which was set up to monitor the U.S. Congress. What was particularly interesting about @congressedits though is that it was open sourced by creator Ed Summers. And that allowed the bot to quickly spread and be adapted for different jurisdictions like Australia, Canada, France, Sweden, Chile, Germany, and even Russia.

Tailoring a bot for different countries is just one (relatively simple) form of adaptation, but I think diversifying bots for different editorial perspectives could similarly benefit from a platform. I would propose that we need to build an open-source news bot architecture that different news and journalistic organizations could use as a scaffolding to encode their own editorial intents, newsworthiness criteria, parameters, data sets, ranking algorithms, cultures, and souls into. By creating a flexible platform as an underlying starting point, the automated media ecology could adapt and diversify faster and into new domains or applications.

Such a platform would also enable the expansion of bots oriented towards different journalistic tasks. A lot of the news and information bots you find on social media these days are parrots of various ilks: they aggregate content on a particular topical niche, like @BadBluePrep@FintechBot and @CelebNewsBot or for a geographical area like @North_GA, or they simply retweet other accounts based on some trigger words. Some of the more sophisticated bots do look at data feeds to generate novel insights, like @treasuryio or @mediagalleries, but there’s so much more that could be done if we had a flexible bot platform.

For instance we might consider building bots that act as information collectors and solicitors, moving away from pure content production to content acquisition. This isn’t so far off really. Researchers at IBM have been working on this for a couple years already and have already build a prototype system that “automatically identifies and ask[s] targeted strangers on Twitter for desired information.” The technology is oriented towards collecting accurate and up-to-date information from specific situations where crowd information may be valuable. It’s relatively easy to imagine an automated news bot being launched after a major news event to identify and solicit information, facts, or photos from people most likely nearby or involved in the event. In another related project the same group at IBM has been developing technology to identify people on Twitter that are more likely to propagate (Read: Retweet) information relating to public safety news alerts. Essentially they grease the gears of social dissemination by identifying just the right people for a given topic and at a particular time who are most likely to further share the information.

There are tons of applications for news bots just waiting for journalists to build them: factchecking, information gathering, network bridging, audience development etc. etc. Robot journalists don’t just have to be reporters. They can be editors, or even (hush) work on the business side.

What I think we don’t want to end up with is the Facebook or Google of robot reporting: “one algorithm to rule them all”. It’s great that the Associated Press is exploring the use of these technologies to scale up their content creation, but down the line when the use of writing algorithms extends far beyond earnings reports, utilizing only one platform may ultimately lead to homogenization and frustrate attempts to build a diverse media sphere. Instead the world that we need to actively create is one where there are thousands of artisanal news bots serving communities and variegated audiences, each crafted to fit a particular context and perhaps with a unique editorial intent. Having an open source platform would help enable that, and offer possibilities to plug in and explore a host of new applications for bots as well.

The Anatomy of a Robot Journalist

Note: A version of the following also appears on the Tow Center blog.

Given that an entire afternoon was dedicated to a “Robot Journalism Bootcamp” at the Global Editors Network Summit this week, it’s probably safe to say that automated journalism has finally gone mainstream — hey it’s only taken close to 40 years since the first story writing algorithm was created at Yale. But there are still lots of ethical questions and debates that we need to sort out, from source transparency to corrections policies for bots. Part of that hinges on exactly how these auto-writing algorithms work: What are their limitations and how might we design them to be more value-sensitive to journalism?

Despite the proprietary nature of most robot journalists, the great thing about patents is that they’re public. And patents have been granted to several major players in the robo-journalism space already, including Narrative Science, Automated Insights, and Yseop, making their algorithms just a little bit less opaque in terms of how they operate. More patents are in the pipeline from both heavy weights like CBS Interactive, and start-ups like Fantasy Journalist. So how does a robo-writer from Narrative Science really work?

Every robot journalist first needs to ingest a bunch of data. Data rich domains like weather were some of the first to have practical natural language generation systems. Now we’re seeing a lot of robot journalism applied to sports and finance — domains where the data can be standardized and made fairly clean. The development of sensor journalism may provide entirely new troves of data for producing automated stories. Key here is having clean and comprehensive data, so if you’re working in a domain that’s still stuck with PDFs or sparse access, the robots haven’t gotten there yet.

After data is read in by the algorithm the next step is to compute interesting or newsworthy features from the data. Basically the algorithm is trying to figure out the most critical aspects of an event, like a sports game. It has newsworthiness criteria built into its statistics. So for example, it looks for surprising statistical deviations like minimums, maximums, or outliers, big swings and changes in a value, violations of an expectation, a threshold being crossed, or a substantial change in a predictive model. “Any feature the value of which deviates significantly from prior expectation, whether the source of that expectation is due to a local computation or from an external source, is interesting by virtue of that deviation from expectation,” the Narrative Science patent reads. So for a baseball game the algorithm computes “win probability” after every play. If win probability has a big delta in-between two plays it probably means something important just happened and the algorithm puts that on a list of events that might be worthy of inclusion in the final story.

Once some interesting features have been identified, angles are then selected from a pre-authored library. Angles are explanatory or narrative structures that provide coherence to the overall story. Basically they are patterns of events, circumstances, entities, and their features. An angle for a sports story might be “back-and-forth horserace”, “heroic individual performance”, “strong team effort”, or “came out of a slump”. Certain angles are triggered according to the presence of certain derived features (from the previous step). Each angle is given an importance value from 1 to 10 which is then used to rank that angle against all of the other proposed angles.

Once the angles have been determined and ordered they are linked to specific story points, which connect back to individual pieces of data like names of players or specific numeric values like score. Story points can also be chosen and prioritized to account for personal interests such as home team players. These points can then be augmented with additional factual content drawn from internet databases such as where a player is from, or a quote or picture of them.

The last step the robot journalist takes is natural language generation, which for the Narrative Science system is done by recursively traversing all of the angle and story point representations and using phrasal generation routines to generate and splice together the actual English text. This is probably by far the most straightforward aspect of the entire pipeline — it’s pretty much just fancy templates.

So, there you have it, the pipeline for a robot journalist: (1) ingest data, (2) compute newsworthy aspects of the data, (3) identify relevant angles and prioritize them, (4) link angles to story points, and (5) generate the output text.

Obviously there can be variations to this basic pipeline as well. Automated insights for example uses randomization to provide variability in output stories and also incorporates a more sophisticated use of narrative tones that can be used to generate text. Based on a desired tone, different text might be generated to adhere to an apathetic, confident, pessimistic, or enthusiastic tone. YSeop on the other hand uses techniques for augmenting templates with metadata so that they’re more flexible. This allows templates to for instance conjugate verbs depending on the data being used. A post generation analyzer (you might call it a robot editor) from YSeop further improves the style of a written text by looking for repeated words and substituting synonyms or alternate words.

From my reading, I’d have to say that the Narrative Science patent seems to be the most informed by journalism. It stresses the notion of newsworthiness and editorial in crafting a narrative. But that’s not to say that the stylistic innovations from Automated Insights, and template flexibility of YSeop aren’t important. What still seems to be lacking though is a broader sense of newsworthiness besides “deviance” in these algorithms. Harcup and O’Neill identified 10 modern newsworthiness values, each of which we might make an attempt at mimicking in code: reference to the power elite, reference to celebrities, entertainment, surprise, bad news, good news, magnitude (i.e. significance to a large number of people), cultural relevance to audience, follow-up, and newspaper agenda. How might robot journalists evolve when they have a fuller palette of editorial intents available to them?

OpenVis is for Journalists!

Note: A version of the following also appears on the Tow Center blog.

Last week I attended the OpenVis Conference in Boston, a smorgasbord of learning dedicated to exploring the use and application of data visualization on the open web, so basically not using proprietary standards. It was hard not to get excited, with a headline keynote like Mike Bostock, the original creator of the popular D3 library for data visualization and now a graphics editor at the New York Times.

Given that news organizations are leading the way with online data storytelling, it was perhaps unsurprising to find a number of journalists presenting at the conference. Kennedy Elliot of the Washington Post talked about coding for the news, imploring attendees to think more like journalists. And we also heard from Lisa Strausfeld and Christopher Cannon who run the new Bloomberg Visual Data lab, and from Lena Groeger at ProPublica who spoke about “thinking small” in visualization.

But even the less overtly journalistic talks somehow seemed to have strong ties and implications for journalism, on everything from storytelling and authoring tools to analytic methodologies. Let me pick on just a few talks that exposed some particularly relevant implications for data journalism.

First up, David Mimno, a professor at Cornell, gave a tour of his work in visualizating machine learning algorithms online to help students learn how those algorithms are working. He demonstrated old classics like k-means and linear regression, but the algorithms were palpable seeing them come to life through animated visualizations. Another example of this comes from the machine learning demos page, which animates and presents an even greater number of algorithms. Where I think this gets really important for journalists is with the whole idea of algorithmic accountability, and the ability to use visualization as a way for journalists to be transparent about the algorithms they use in their reporting.

A good example of where this is already happening is the explanation of the NYT4thDownBot where authors Brian Burke and Kevin Queally use a visualization of a football field (shown below) to explain how their predictive model differs from what actual football coaches tend to do. To the extent that algorithms are deserving of our scrutiny, visualization methods to communicate what they are doing and to somehow make them legible to the public seems incredibly powerful and important for us to work more on.

Alexander Howard recently wrote about “the difficult, complicated process of reporting on data as a source” while being as open and transparent as possible. If there’s one thing the recent launch of 538 has taught us is that there’s a need (and demand) to make the data, and even the code or models, available for data journalism projects.

People are already developing workflows and tools to make this possible online. Another great talk at OpenVis was by Dr. Jake Vanderplas, an astrophysicist working at the University of Washington, who has developed some really amazing open source technology that lets you create interactive D3 visualizations in the browser directly from IPython notebooks. Jake’s work on visualization takes us one step closer to enabling a complete end-to-end workflow for data journalists: data, analysis, and code can sit in the browser and directly render interactive visualizations for the end user. The whole stack is transparent and could potentially even enable the user to tweak, tune, or test variations. To the extent that reproducibility of data journalism projects becomes important to maintain the trust of the audience these sorts of platforms are certainly worth learning more about.

Because of its emphasis on openness and the relationship to transparency and the desire to create news content online, expect OpenVis to continue to develop next year as a key destination for journalists looking to learn more about visualization.

Computational Journalism and The Reporting of Algorithms

Note: A version of the following also appears on the Tow Center blog.

Software and algorithms have come to adjudicate an ever broader swath of our lives, including everything from search engine personalization and advertising systems, to teacher evaluation, banking and finance, political campaigns, and police surveillance.  But these algorithms can make mistakes. They have biases. Yet they sit in opaque black boxes, their inner workings, their inner “thoughts” hidden behind layers of complexity.

We need to get inside that black box, to understand how they may be exerting power on us, and to understand where they might be making unjust mistakes. Traditionally, investigative journalists have helped hold powerful actors in business or government accountable. But today, algorithms, driven by vast troves of data, have become the new power brokers in society. And the automated decisions of algorithms deserve every bit as much scrutiny as other powerful and influential actors.

Today the Tow Center publishes a new Tow/Knight Brief, “Algorithmic Accountability Reporting: On the Investigation of Black Boxes” to start tackling this issue. The Tow/Knight Brief presents motivating questions for why algorithms are worthy of our investigations, and develops a theory and method based on the idea of reverse engineering that can help parse how algorithms work. While reverse engineering shows promise as a method, it will also require the dedicated investigative talents of journalists interviewing algorithms’ creators as well. Algorithms are, after all, manifestations of human design.

If you’re in NYC next week, folks from the New York Times R&D lab are pushing the idea forward in their Impulse Response Workshop. And if you’re at IRE and NICAR’s 2014 CAR Conference in Baltimore on Feb 28th, I’ll be joined by Chase Davis, Frank Pasquale, and Jeremy Singer-Vine for an in-depth discussion on holding algorithms accountable. In the mean time, have a read of the paper, and let me know your thoughts, comments, and critiques.

Making Data More Familiar with Concrete Scales

Note: A version of the following also appears on the Tow Center blog.

 

As part of their coverage of the Snowden leaks, last month the Guardian published an interactive to help explain what the NSA data collection activities mean for the public. Above is a screenshot of part of the piece. It allows the user to input the number of friends they have on Facebook and see a typical number of 1st, 2nd (friends-of-friends), and 3rd (friends-of-friends-of-friends) degree connections as compared to places where you typically find different numbers of people. So 250 friends is more than the capacity of a subway car, 40,850 friends-of-friends is more than would fit in Fenway Park, and 6.7 million 3rd degree connections is bigger than the population of Massachusetts.

When we tell stories with data it can be hard for readers to grasp units or measures that are outside of normal human experience or outside of their own personal experience. How much *is* 1 trillion dollars, or 200 calories, really? Unless you’re an economist, or a nutritionist respectively it might be hard to say. Abstract measures and units can benefit from making them more concrete. The idea behind the Guardian interactive was to take something abstract, like a big number of people, and compare it to something more spatially familiar and tangible to help drive it home and make it real.

Researchers Fanny ChevalierRomain Vuillemot, and  Guia Gali have been studying the use of such concrete scales in visualization and recently published a paper detailing some of the challenges and practical steps we can use to more effectively employ these kinds of scales in data journalism and data visualization.

In the paper they describe a few different strategies for making concrete scales, including unitization, anchoring, and analogies. Shown in the figure below, (a) unitization is the idea of re-expression one object in terms of a collection of objects that may be more familiar (e.g. the mass of Saturn is 97 times that of Earth); (b) anchoring uses a familiar object, like the size of a match head, to make the size of another unfamiliar object (e.g. a tick in this case) more concrete; and (c) analogies make parallel comparisons to familiar objects (e.g. atom is to marble, as human head is to earth).

All of these techniques are really about visual comparison to the more familiar. But the familiar isn’t necessarily exact. For instance, if I were to compare the height of the Empire State Building to a number of people stacked up, I would need to use the average height of a person, which is really an idealized approximation. So it’s important to think about the precision of the visual comparisons you might be setting up with concrete scales.

Another strategy often used with concrete scales is containment, which can be useful to communicate impalpable volumes or collections of material. For example you might want to make visible the amount of sugar in different sizes of soda bottles by filling plastic bags with different amounts of granular sugar. Again, this is an approximate comparison but also makes it more familiar and material.

So, how can you design data visualizations to effectively use concrete scales? First you should ask if it’s an unfamiliar unit or whether it has an extreme magnitude that would make it difficult to comprehend. Then you need to find a good comparison unit that is more familiar to people. Does it make sense to unitize, anchor, or use an analogy? And if you use an anchor or container, which one should you choose? The answers to these questions will depend on your particular design situation as well as the semantics of the data you’re working with. A number of examples that the researchers have tagged are available online.

The individual nature of “what is familiar” does beg the question about personalization of concrete scales too. Michael Keller’s work for Al Jazeera lets you compare the number of refugees from the Syrian conflict to a geographic extent in the US, essentially letting the user’s own familiarity with geography guide what area they want to compare as an anchor. What if this type of personalization could also be automated? Consider logging into Facebook or Twitter and the visualization adapting to use concrete scales to the places, objects, or organizations you’re most familiar with based on your profile information. This type of automated visualization adaptation could help make such visual depictions of data much more personally relevant and interesting.

Even though concrete scales are often used in data visualizations in the media it’s worth also realizing that there are some open questions too. How do we define whether an anchor or unit is “familiar” or not, and what makes one concrete unit better than another? Perhaps some scales make people feel like they understand the visualization better or help the reader remember the visualization better. There are still many open questions for empirical research.

Storytelling with Data Visualization: Context is King

Note: A version of the following also appears on the Tow Center blog.

Data is like a freeze-dried version of reality, abstracted sometimes to the point where it can be hard to recognize and understand. It needs some rehydration before it becomes tasty (or even just palatable) storytelling material again — something that visualization can often help with. But to fully breathe life back into your data, you need to crack your knuckles and add a dose of written explanation to your visualizations as well. Text provides that vital bit of context layered over the data that helps the audience come to a valid interpretation of what it really means.

So how can you use text and visualization together to provide that context and layer a story over your data? Some recently published research by myself and collaborators at the University of Michigan offers some insights.

In most journalistic visualization, context is added to data visualization through the use of labels, captions, and other annotations — texts — of various kinds. Indeed, on the Economist Graphic Detail blog, visualizations not only have integrated textual annotations, but an entire 1-2 paragraph introductory article associated with them. In addition to adding an angle and story to the piece, such contextual journalism helps flesh out what the data means and guides the reader’s interpretation towards valid inferences from the data. Textual annotations integrated directly with a visualization can further guide the users’ interactions, emphasizing certain points, prioritizing particular interpretations of data, or pre-empting the user’s curiosity on seeing a salient outlier, aberration, or trend.

To answer the question of how textual annotations function as story contextualizers in online news visualization we analyzed 136 professionally made news visualizations produced by the New York Times and the Guardian between 2000 and July 2012. Of course we found text used for everything from axes labels, author information, sources, and data provenance, to instructions, definitions, and legends, but we were were less interested in studying these kinds of uses than in annotations that were more related to data storytelling.

Based on our analysis we recognized two underlying functions for annotations: (1) observational, and (2) additive. Observational annotations provide context by supporting reflection on a data value or group of values that are depicted in the visualization. These annotations facilitate comparisons and often highlight or emphasize extreme values or other outliers. For interactive graphics they are sometimes revealed when hovering over a visual element.

A basic form of observational messaging is apparent in the following example from the New York Times, showing the population pyramid in the U.S. On the right of the graphic text clearly indicates observations of the total number and fraction of the population expected to be over age 65 by 2015. This is information that can be observed in the graph but is being reinforced through the use of text.

Another example from the Times shows how observational annotations can be used to highlight and label extremes on a graph. In the chart below, the U.S. budget forecast is depicted, and the low point of 2010 is highlighted with a yellow circle together with an annotation. The value and year of that point are already visible in the graph, which is what makes this kind of annotation observational. Consider using observational annotations when you want to underscore something that’s visible in the visualization, but which you really want to make sure the user sees, or when there is an interesting comparison that you would like to draw the user’s attention towards.

On the other hand, additive annotation provides context that is external to the visual representation and not clearly depicted via the data. These are things that are relevant to the topic or to understanding the data, like background or contemporaneous events or actions. It’s up to you to decide which dimensions of who, what, where, when, why, and how are relevant. If you think the viewer needs to be aware of something in order to interpret the data correctly, then an additive annotation might be appropriate.

The following example from The Minneapolis Star Tribune shows changes in home prices across counties in Minnesota with reference to the peak of the housing bubble, a key bit of additive annotation attached to the year 2007. At the same time, the graphic also uses observational annotation (on the right side) by labeling the median home price and percent change since 2007 for the selected county.

Use of these types of annotation is very prevalent; in our study of 136 examples we found 120 (88.2%) used at least one of these forms of annotation. We also looked at the relative use of each, shown in the next figure. Observational annotations were used in just shy of half of the cases, whereas additive were used in 73%.

Another dimension to annotation is what scope of the visualization is being referenced: an individual datum, a group of data, or the entire view (e.g. a caption-like element). We tabulated the prevalence of these annotation anchors and found that single datum annotations are the most frequently used (74%). The relative usage frequencies are shown in the next figure. Your choice of what scope of the visualization to annotate will often depend on the story you want to tell, or on what kinds of visual features are most visually salient, such as outliers, trends, or peaks. For instance, trends that happen over longer time-frames in a line-graph might benefit from a group annotation to indicate how a collection of data points is trending, whereas a peak in a time-series would most obviously benefit from an annotation calling out that specific data point.

The two types of annotation, and three types of annotation anchoring are summarized in the following chart depicting stock price data for Apple. Annotations A1 and A2 show additive annotations attached to the whole view, and to a specific date in the view, whereas O1 and O2 show observational annotations attached to a single datum and a group of data respectively.

As we come to better understand how to tell stories with text and visualization together, new possibilities also open up for how to integrate text computationally or automatically with visualization.

In our research we used the above insights about how annotations are used by professionals to build a system that analyzes a stock time series (together with its trade volume data) to look for salient points and automatically annotate the series with key bits of additive context drawn from a corpus of news articles. By ranking relevant news headlines and then deriving graph annotations we were able to automatically generate contextualized stock charts and create a user-experience where users felt they had a better grasp of the trends and oscillations of the stock.

On one hand we have the fully automated scenario, but in the future, more intelligent graph authoring tools for journalists might also incorporate such automation to suggest possible annotations for a graph, which an editor could then tweak or re-write before publication. So not only can the study of news visualizations help us understand the medium better and communicate more effectively, but it can also enable new forms of computational journalism to emerge. For all the details please see our research paper, “Contextifier: Automatic Generation of Annotated Stock Visualizations.”

Algorithmic Defamation: The Case of the Shameless Autocomplete

Note: A version of the following also appears on the Tow Center blog.

In Germany, a man recently won a legal battle with Google over the fact that when you searched for his name, the autocomplete suggestions connected him to “scientology” and “fraud,” — two things that he felt had defamatory insinuations. As a result of losing the case, Google is now compelled to remove defamatory suggestions from autocomplete results when notified, in Germany at least.

Court cases arising from autocomplete defamation aren’t just happening in Germany though. In other European countries like Italy, France, and Ireland, to as wide afield as Japan and Australia people (and corporations) have brought suit alleging these algorithms defamed them by linking their names to everything from crime and fraud to bankruptcy or sexual conduct. In some cases such insinuations can have real consequences for finding jobs or doing business. New services, such as brand.com’s “Google Suggest Plan” have even arisen to help people manipulate and thus avoid negative connotations in search autocompletions.

The Berkman Center’s Digital Media Law Project (DMLP) defines a defamatory statement generally as, “a false statement of fact that exposes a person to hatred, ridicule or contempt, lowers him in the esteem of his peers, causes him to be shunned, or injures him in his business or trade.” By associating a person’s name with some unsavory behavior it would seem indisputable that autocomplete algorithms can indeed defame people.

So if algorithms like autocomplete can defame people or businesses, our next logical question might be to ask how to hold those algorithms accountable for their actions. Considering the scale and difficulty of monitoring such algorithms, one approach would be to use more algorithms to keep tabs on them and try to find instances of defamation hidden within their millions (or billions) of suggestions.

To try out this approach I automatically collected data on both Google and Bing autocompletions for a number of different queries relating to public companies and politicians. I then filtered these results against keyword lists relating to crime and sex in order to narrow in on potential cases of defamation. I used a list of the corporations on the S&P 500 to query the autocomplete APIs with the following templates, where “X” is the company name: “X,” “X company,” “X is,” “X has,” “X company is,” and “X company has.” And I used a list of U.S. congresspeople from the Sunlight Foundation to query for each person’s first and last name, as well as adding either “representative” or “senator” before their name. The data was then filtered using a list of sex-related keywords, and words related to crime collected from the Cambridge US dictionary in order to focus on a smaller subset of the almost 80,000 autosuggestions retrieved.

Among the corporate autocompletions that I filtered and reviewed, there were twenty-four instances that could be read as statements or assertions implicating the company in everything from corruption and scams to fraud and theft. For instance, querying Bing for “Torchmark” returns as the second suggestion, “torchmark corporation job scam.” Without really digging deeply it’s hard to tell if Torchmark corporation is really involved in some form of scam, or if there’s just some rumors about scam-like emails floating around. If those rumors are false, this could indeed be a case of defamation against the company. But this is a dicey situation for Bing, since if they filtered out a rumor that turned out to be true it might appear they were trying to sweep a company’s unsavory activities under the rug. People would ask: Is Bing trying to protect this company? At the same time they would be doing a disservice to their users by not steering them clear of a scam.

While looking through the autocompletions returned from querying for congresspeople it became clear that a significant issue here relates to name collisions. For relatively generic congressperson names like “Gerald Connolly” or “Joe Barton” there are many other people on the internet with the same names. And some of those people did bad things. So when you Google for “Gerald Connolly” one suggestion that comes up is “gerald connolly armed robbery,” not because Congressman Gerald Connolly robbed anyone but because someone else in Canada by the same name did. If you instead query for “representative Gerald Connolly” the association goes away; adding “representative” successfully disambiguates the two Connollys. The search engine has it tough though: Without a disambiguating term how should it know you’re looking for the congressman or a robber? There are other cases that may be more clear-cut instances of defamation, such as on Bing “Joe Barton” suggesting “joe barton scam” which was not corrected when adding the title “representative” to the front of the query. That seems to be more of a legitimate instance of defamation since even with the disambiguation it’s still suggesting the representative is associated with a scam. And with a bit more searching around it’s also clear there is a scam related to a Joe Barton, just not the congressman.

Some of the unsavory things that might hurt someone’s reputation in autocomplete suggestions could be true though. For instance, an autocompletion for representative “Darrell Issa” to “Darrell Issa car theft” is a correct association arising from his involvement with three separate car theft cases (for which his brother ultimately took the rap). To be considered defamation, the statement must actually be false, which makes it that much harder to write an algorithm that can find instances of real defamation. Unless algorithms can be developed that can detect rumor and falsehood, you’ll always need a person assessing whether an instance of potential defamation is really valid. Still, such tips on what might be defamatory can help filter and focus attention.

Understanding defamation from a legal standpoint brings in even more complexity. Even something that seems, from a moral point of view, defamatory might not be considered so by a court of law. Each state in the U.S. is a bit different in how it governs defamation. A few key nuances relevant to the court’s understanding of defamation relate to perception and intent.

First of all, a statement must be perceived as fact and not opinion in order to be considered defamation by the court. So how do people read search autocompletions? Do they see them as collective opinions or rumors reflecting the zeitgeist, or do they perceive them as statements of fact because of their framing as a result from an algorithm? As far as I know this is an open question for research. If autocompletions are read as opinion, then it might be difficult to ever win a defamation case in the U.S. against such an algorithm.

For defamation suits against public figures intent also becomes an important factor to consider. The plaintiff must prove “actual malice” with regards to the defamatory statement, which means that a false statement was published either with actual knowledge of its falsity, or reckless disregard for its falsity. But can an algorithm ever be truly malicious? If you use the argument that autocompletions are just aggregations of what others have already typed in, then actual malice could certainly arise from a group of people systematically manipulating the algorithm. Otherwise, the algorithm would have to have some notion of truth, and be “aware” that it was autocompleting something inconsistent with its knowledge of that truth. This could be especially challenging for things who’s truth changes over time, or for rumors which may have a social consensus but still be objectively false. So while there have been attempts at automating factchecking I think this is a far way off.

Of course this may all be moot under Section 230 of the Communications Decency Act, which states that, “no provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider.” Given that search autocompletions are based on queries that real people at one time typed into a search box, it would seem Google has a broad protection under the law against any liability from republishing those queries as suggestions. It’s unclear though, at least to me, if recombining and aggregating data from millions of typed queries can really be considered “re-publishing” or if it should rather be considered publishing anew. I suppose it would depend on the degree of transformation of the input query data into suggestions.

Whether it’s Google’s algorithms creating new snippets of text as autocomplete suggestions, or Narrative Science writing entire articles from data, we’re entering a world where algorithms are synthesizing communications that may in some cases run into moral (or legal) considerations like defamation. In print we call defamation libel; when orally communicated we call it slander. We don’t yet have a word for the algorithmically reconstituted defamation that arises when millions of non-public queries are synthesized and publicly published by an aggregative intermediary. Still, we might try to hold such algorithms to account, by using yet more algorithms to systematically assess and draw human attention to possible breaches of trust. It may be some time yet, if ever, when we can look to the U.S. court system for adjudication.

Sex, Violence, and Autocomplete Algorithms: Methods and Context

In my Slate article “Sex, Violence, and Autocomplete Algorithms,” I use a reverse-engineering methodology to better understand what kinds of queries get blocked by Google and Bing’s autocomplete algorithms. In this post I want to pull back the curtains a bit to talk about my process as well as add some context to the data that I gathered for the project.

To measure what kinds of sex terms get blocked I first found a set of sex-related words that are part of a larger dictionary called LIWC (Linguistic Inquiry and Word Count) which includes painstakingly created lists of words for many different concepts like perception, causality, and sex among others. It doesn’t include a lot of slang though, so for that I augmented my sex-word list with some more gems pulled from the Urban Dictionary, resulting in a list of 110 words. The queries I tested included the word by itself, as well as in the phrase “child X” in an attempt to identify suggestions related to child pornography.

For the violence-related words that I tested, I used a set of 348 words from the Random House “violent actions” list, which includes everything from the relatively innocuous “bop” to the more ruthless “strangle.” To construct queries I put the violent words into two phrases: “How to X” and “How can I X.”

Obviously there are many other words and permutations of query templates that I might have used. One of the challenges with this type of project is how to sample data and where to draw the line on what to collect.

With lists of words in hand the next step was to prod the APIs of Google and Bing to see what kind of autocompletions were returned (or not) when queried. The Google API for autocomplete is undocumented, though I found and used some open-source code that had already reverse engineered it. The Bing API is similarly undocumented, but a developer thread on the Bing blog mentions how to access it. I constructed each of my query words and templates and, using these APIs, recorded what suggestions were returned.

An interesting nuance to the data I collected is that both APIs return more responses than actually show up in either user interface. The Google API returns 20 results, but only shows 4 or 10 in the UI depending on how preferences are set. The Bing API returns 12 results but only shows 8 in the UI. Data returned from the API that never appears in the UI is less interesting since users will never encounter it in their daily usage. But, I should mention that it’s not entirely clear what happens with the API results that aren’t shown. It’s possible some of them could be shown during the personalization step of the algorithm (which I didn’t test).

The queries were run and data collected on July 2nd, 2013, which is important to mention since these services can change without notice. Indeed, Google claims to change its search algorithm hundreds of times per year. Autocomplete suggestions can also vary by geography or according to who’s logged in. Since the APIs were accessed programmatically, and no one was logged in, none of the results collected reflect any personalization that the algorithm performs. However, the results may still reflect geography since figuring out where your computer is doesn’t require a log in. The server I used to collect data is located in Delaware. It’s unclear how Google’s “safe search” settings might have affected the data I collected via their API. The Bing spokesperson I was in touch with wrote, “Autosuggest adheres to a ‘strict’ filter policy for all suggestions and therefore applies filtering to all search suggestions, regardless of the SafeSearch settings for the search results page.”

In the spirit of full transparency, here is a .csv to all of the queries and responses that I collected.

The Rhetoric of Data

Note: A version of the following also appears on the Tow Center blog.

In the 1830’s abolitionists discovered the rhetorical potential of re-conceptualizing southern newspaper advertisements as data. They “took an undifferentiated pile of ads for runaway slaves, wherein dates and places were of primary importance … and transformed them into data about the routine and accepted torture of enslaved people,” writes Ellen Gruber Garvey in the book Raw Data is an Oxymoron. By creating topical dossiers of ads, the horrors of slavery were catalogued and made accessible for writing abolitionist speeches and novels. The South’s own media had been re-contextualized into a persuasive weapon against itself, a rhetorical tool to bolster the abolitionists’ arguments.

The Latin etymology of “data” means “something given,” and though we’ve largely forgotten that original definition, it’s helpful to think about data not as facts per se, but as “givens” that can be used to construct a variety of different arguments and conclusions; they act as a rhetorical basis, a premise. Data does not intrinsically imply truth. Yes we can find truth in data, through a process of honest inference. But we can also find and argue multiple truths or even outright falsehoods from data.

Take for instance the New York Times interactive, “One Report, Diverging Perspectives,” which wittingly highlights this issue. Shown below, the piece visualizes jobs and unemployment data from two perspectives, emphasizing the differences in how a democrat or a republican might see and interpret the statistics. A rising tide of “data PR” often manifesting as slick and pointed infographics won’t be so upfront about the perspectives being argued though. Advocacy organizations can now collect their own data, or just develop their own arguments from existing data for supporting their cause. What should you be looking out for as a journalist when assessing a piece of data PR? And how can you improve your own data journalism by ensuring the argument you develop is a sound one?

one report diverging perspectives

Contextual journalism—adding interpretation or explanation to a story—can and should be applied to data as much as to other forms of reporting. It’s important because the audience may need to know the context of a dataset in order to fully understand and evaluate the larger story in perspective. For instance, context might include explaining how the data was collected, defined, and aggregated, and what human decision processes contributed to its creation. Increasingly news outlets are providing sidebars or blog posts that fully describe the methodology and context of the data they use in a data-driven story. That way the context doesn’t get in the way of the main narrative but can still be accessed by the inquisitive reader.

In your process it can be useful to ask a series of contextualizing questions about a dataset, whether just critiquing the data, or producing your own story.

Who produced the data and what was their intent? Did it come from a reputable source, like a government or inter-governmental agency such as the UN, or was it produced by a third party corporation with an uncertain source of funding? Consider the possible political or advocacy motives of a data provider as you make inferences from that data, and do some reporting if those motives are unclear.

When was the data collected? Sometimes there can be temporal drift in what data means, how it’s measured, or how it should be interpreted. Is the age of your data relevant to your interpretation? For example, in 2010 the Bureau of Labor Statistics changed the definition of long-term unemployment, which can make it important to recognize that shift when comparing data from before and after the change.

Most importantly it’s necessary to ask what is measured in the data, how was it sampled, and what is ultimately depicted? Are data measurements defined accurately and in a way that they can be consistently measured? How was the data sampled from the world? Is the dataset comprehensive or is it missing pieces? If the data wasn’t randomly sampled how might that reflect a bias in your interpretation? Or have other errors been introduced into the data, for instance through typos or mistaken OCR technology? Is there uncertainty in the data that should be communicated to the reader? Has the data been cropped or filtered in a way that you have lost a potentially important piece of context that would change its interpretation? And what about aggregation or transformation? If a dataset is offered to you with only averages or medians (i.e. aggregations) you’re necessarily missing information about how the data might be distributed, or about outliers that might make interesting stories. For data that’s been transformed through some algorithmic process, such as classification, it can be helpful to know the error rates of that transformation as this can lead to additional uncertainty in the data.

Let’s consider an example that illustrates the importance of measurement definition and aggregation. The Economist graphic below shows the historic and forecast vehicle sales for different geographies. The story the graph tells is pretty clear: Sales in China are rocketing up while they’re declining or stagnant in North America and Europe. But look more closely. The data for Western Europe and North America is defined as an aggregation of light vehicle sales, according to the note in the lower-right corner. How would the story change if the North American data included truck, SUV, and minivan sales? The story you get from these kinds of data graphics can depend entirely on what’s aggregated (or not aggregated) together in the measure. Aggregations can serve as a tool of obfuscation, whether intentional or not.

 vehicle sales

It’s important to recognize and remember that data does not equal truth. It’s rhetorical by definition and can be used for truth finding or truth hiding. Being vigilant in how you develop arguments from data and showing the context that leads to the interpretation you make can only help raise the credibility of your data-driven story.

 

Data on the Growth of CitiBike

On May 27th New York City launched its city-wide bike sharing program, CitiBike. I tried it out last weekend; it was great, aside from a few glitches checking-out and checkin-in the bikes. It made me curious about the launch of the program and how it’s growing, especially since the agita between bikers and drivers is becoming quite palpable. Luckily, the folks over at the CitiBike blog have been posting daily stats about the number of rides every day, average duration of rides, and even the most popular station for starting and stopping a ride. If you’re interested in hacking more on the data there’s even a meetup happening next week.

Below is my simple line chart of the total number of daily riders (they measure that as of 5pm that day). Here’s the data. You might look at the graph and wonder, “what happened June 7th?”. That was the monsoon we had. Yeah, turns out bikers don’t like rain.

citibike2