51% Foreign: Algorithms and the Surveillance State

In New York City there’s a “geek squad” of analysts that gathers all kinds of data, from restaurant inspection grades and utility usage to neighborhood complaints, and uses it to predict how to improve the city. The idea behind the team is that with more and more data available about how the city is running—even if it’s messy, unstructured, and massive—the government can optimize its resources by keeping an eye out for what needs its attention most. It’s really about city surveillance, and of course acting on the intelligence produced by that surveillance.

One story about the success of the geek squad comes to us from Viktor Mayer-Schonberger and Kenneth Cukier in their book “Big Data”. They describe the issue of illegal real-estate conversions, which involves sub-dividing an apartment into smaller and smaller units so that it can accommodate many more people than it should. With the density of people in such close quarters, illegally converted units are more prone to accidents, like fire. So it’s in the city’s—and the public’s—best interest to make sure apartment buildings aren’t sub-divided like that. Unfortunately there aren’t very many inspectors to do the job. But by collecting and analyzing data about each apartment building the geek squad can predict which units are more likely to pose a danger, and thus determine where the limited number of inspectors should focus their attention. Seventy percent of inspections now lead to eviction orders from unsafe dwellings, up from 13% without using all that data—a clear improvement in helping inspectors focus on the most troubling cases.

Consider a different, albeit hypothetical, use of big data surveillance in society: detecting drunk drivers. Since there are already a variety of road cameras and other traffic sensors available on our roads, it’s not implausible to think that all of this data could feed into an algorithm that says, with some confidence, that a car is exhibiting signs of erratic, possibly drunk driving. Let’s say, similar to the fire-risk inspections, that this method also increases the efficiency of the police department in getting drunk drivers off the road—a win for public safety.

But there’s a different framing at work here. In the fire-risk inspections the city is targeting buildings, whereas in the drunk driving example it’s really targeting the drivers themselves. This shift in framing—targeting the individual as opposed to the inanimate–crosses the line into invasive, even creepy, civil surveillance.

So given the degree to which the recently exposed government surveillance programs target individual communications, it’s not as surprising that, according to Gallup, more Americans disapprove (53%) than approve (37%) of the federal government’s program to “compile telephone call logs and Internet communications.” This is despite the fact that such surveillance could in a very real way contribute to public safety, just as with the fire-risk or drunk driving inspections.

At the heart of the public’s psychological response is the fear and risk of surveillance uncovering personal communication, of violating our privacy. But this risk is not a foregone conclusion. There’s some uncertainty and probability around it, which makes it that much harder to understand the real risk. In the Prism program, the government surveillance program that targets internet communications like email, chats, and file transfers, the Washington Post describes how analysts use the system to “produce at least 51 percent confidence in a target’s ‘foreignness’”. This test of foreignness is tied to the idea that it’s okay (legally) to spy on foreign communications, but that it would breach FISA (the Foreign Intelligence Surveillance Act), as well as 4th amendment rights for the government to do the same to American citizens.

Platforms used by Prism, such as Google and Facebook, have denied that they give the government direct access to their servers. The New York Times reported that the system in place is more like having a locked mailbox where the platform can deposit specific data requested pursuant to a court order from the Foreign Intelligence Surveillance Court. But even if such requests are legally targeted at foreigners and have been faithfully vetted by the court, there’s still a chance that ancillary data on American citizens will be swept up by the government. “To collect on a suspected spy or foreign terrorist means, at minimum, that everyone in the suspect’s inbox or outbox is swept in,” as the Washington Post writes. And typically data is collected not just of direct contacts, but also contacts of contacts. This all means that there’s a greater risk that the government is indeed collecting data on many Americans’ personal communications.

Algorithms, and a bit of transparency on those algorithms, could go a long way to mitigating the uneasiness over domestic surveillance of personal communications that American citizens may be feeling. The basic idea is this: when collecting information on a legally identified foreign target, for every possible contact that might be swept up with the target’s data, an automated classification algorithm can be used to determine whether that contact is more likely to be “foreign” or “American”. Although the algorithm would have access to all the data, it would only output one bit of metadata for each contact: is the contact foreign or not? Only if the contact was deemed highly likely to be foreign would the details of that data be passed on to the NSA. In other words, the algorithm would automatically read your personal communications and then signal whether or not it was legal to report your data to intelligence agencies, much in the same way that Google’s algorithms monitor your email contents to determine which ads to show you without making those emails available for people at Google to read.

The FISA court implements a “minimization procedure” in order to curtail incidental data collection from people not covered in the order, though the exact process remains classified. Marc Ambinder suggests that, “the NSA automates the minimization procedures as much as it can” using a continuously updated score that assesses the likelihood that a contact is foreign.  Indeed, it seems at least plausible that the algorithm I suggest above could already be a part of the actual minimization procedure used by NSA.

The minimization process reduces the creepiness of unfettered government access to personal communications, but at the same time we still need to know how often such a procedure makes mistakes. In general there are two kinds of mistakes that such an algorithm could make, often referred to as false positives and false negatives. A false negative in this scenario would indicate that a foreign contact was categorized by the algorithm as an American. Obviously the NSA would like to avoid this type of mistake since it would lose the opportunity to snoop on a foreign terrorist. The other type of mistake, false positive, corresponds to the algorithm designating a contact as foreign even though in reality it’s American. The public would want to avoid this type of mistake because it’s an invasion of privacy and a violation of the 4th amendment. Both of these types of errors are shown in the conceptual diagram below, with the foreign target marked with an “x” at the center and ancillary targets shown as connected circles (orange is foreign, blue is American citizen).


It would be a shame to disregard such a potentially valuable tool simply because it might make mistakes from time to time. To make such a scheme work we first need to accept that the algorithm will indeed make mistakes. Luckily, such an algorithm can be tuned to make more or less of either of those mistakes. As false positives are tuned down false negatives will often increase, and vice versa. The advantage for the public would be that it could have a real debate with the government about what magnitude of mistakes is reasonable. How many Americans being labeled as foreigners and thus subject to unwarranted search and seizure is acceptable to us? None? Some? And what’s the trade-off in terms of how many would-be terrorists might slip through if we tuned the false positives down?

To begin a debate like this the government just needs to tell us how many of each type of mistake its minimization procedure makes; just two numbers. In this case, minimal transparency of an algorithm could allow for a robust public debate without betraying any particular details or secrets about individuals. In other words, we don’t particularly need to know the gory details of how such an algorithm works. We simply need to know where the government has placed the fulcrum in the tradeoff between these different types of errors. And by implementing smartly transparent surveillance maybe we can even move more towards the world of the geek squad, where big data is still ballyhooed for furthering public safety.

To Save Everything, Deliberate it Endlessly?

Evgeny Morozov’s book To Save Everything, Click Here is a worthwhile tour de force of technology criticism that will have you double-taking on everything you hold near and dear about the Internet. The book’s basic premise is that of a polemic against the ideas of “solutionism” (i.e., the tendency to apply efficiency-oriented, engineering fixes to societal problems) and “internet centrism” (i.e., the treatment of the internet as an infallible, ever-positive force on humanity). He covers a gamut, raising flags of caution and moral suspicion on everything from openness and transparency, to algorithms in the media, predictive policing, the quantified self, nudging, and gamification, among many others. Sardonic and bombastic as it sometimes reads, it’s quite well-written, wittingly exposing some useful critiques of our modern techno-lust culture.

As Morozov deftly points out through his many examples, once we realize that designed technologies embed values and moral judgements, we can begin to make decisions about our designed environment and society that reflect the values and morals that we deem respectful to humanity, not just for corporations or other stakeholders. He’s on the side of the people! It’s really about human dignity in the way our designed world influences both individual and collective behavior. This main thread of thinking reminds me of Batya Friedman’s work on value-sensitive design, which attempts to account for human values in a comprehensive manner throughout the design process by identifying stakeholders, benefits, values, and value conflicts to help inform design decisions.

Unfortunately the internal consistency of the book comes under some tension during the last couple chapters, when Morozov tackles the issues of nudging, the information diet, and his own solution to encouraging more deliberation and reflection.

Morozov positions nudging as “solutionism by other means.” He argues that to nudge assumes a social consensus, which may or may not in fact exist, both in terms of what is nudged as well as in which direction. The nudge assumes something is askew, which can and should be brought back into harmony. One nudge you might consider is to encourage the public to consume a more nutritious “information diet” (a la Clay Johnson’s book of the same title). But Morozov positions Johnson’s ideas as “a fairly traditional critique of how the public allocates attention to news,” the end result of which espouses the ideal that citizens should stay informed about every possible issue—clearly an impossibility. The reality, if you agree with Walter Lippmann, reads differently: citizens don’t want to know everything about everything, nor do they have time to, which is why they delegate. In critiquing nudging and the idea of the omniscient citizen, Morozov sides with Lippmann: nudging people to be experts on everything is futile.

But this is where we find the tension with what is offered in the final chapter of the book. The “solution” proffered for “solutionism” and “internet centrism” is to replace the “fetish for psychology” with a penchant for moral and political philosophy and a desire to encourage healthy reflective deliberation by everyday users on the designs of technologies affecting society. I do agree with the general desire for more reflection in the technologies we build. But to suggest, as he does, that to do so we should design technologies to encourage users to be more reflective and deliberative is still just nudging. Moreover, his rejection of omnicompetence contradicts his argument for nudging citizens to be more deliberative, since  how could we expect citizens to be expert and care enough to deliberate on everything?  Criticizing nudging and omnicompetence and then offering them as a way forward suggests that Morozov’s real gripe is that the values embedded in nudging as well as the solutions offered by silicon valley, and indeed the internet itself are simply not his own.

Just as not every citizen is part of every public that emerges around an issue, not every citizen needs to reflect and deliberate on every given technology in society. The interested parties will deliberate, then a design will be fashioned, and the rest of society will delegate to that design, or any number of other designs. Putting on my user-experience designer hat, I believe that incessantly confronting end-users with philosophical dilemma will ultimately prove unproductive in many contexts; people need to actually use these things, to accomplish real tasks. Can you imagine the design of an airline cockpit that constantly confronts the pilot with philosophical choices? Crash. It’s true that, in Morozov’s words, “We need to develop a better way of evaluating, comparing, and discriminating across technological fixes,” but the locus for that activity  will often fall on the design-side of the equation. Detailed design rationale can then make this accountable and legible to any interested public that may emerge.

Under some circumstances it may indeed make sense to facilitate additional reflection in users, but what’s lacking in the book is a solid treatment of the limitations of Morozov’s approach. When should we design for deliberation, and when should we design for efficiency? Morozov has shown us some of the things we miss when we over-emphasize design for efficiency, but not, unfortunately, what we may miss by overemphasizing design for deliberation.

Storytelling with Data: What Are the Impacts on the Audience?

Storytelling with data visualization is still very much in its “Wild West” phase, with journalism outlets blazing new paths in exploring the burgeoning craft of integrating the testimony of data together with compelling narrative. Leaders such as The News York Times create impressive data-driven presentations like 512 Paths to the White House (seen above) that weave complex information into a palatable presentation. But as I look out at the kinds of meetings where data visualizers converge, like EyeoTapestryOpenVis, and the infographics summit Malofiej, I realize there’s a whole lot of inspiration out there, and some damn fine examples of great work, but I still find it hard to get a sense of direction — which way is West, which way to the promised land?

And it occurred to me: We need a science of data-visualization storytelling. We need some direction. We need to know what makes a data story “work”. And what does a data story that “works” even mean?

Examples abound, and while we have theories for color use, visual salience and perception, and graph design that suggest how to depict data efficiently, we still don’t know, with any particular scientific rigor, which are better stories. At the Tapestry conference, where I attended, journalists such as Jonathan CorumHannah Fairfield, and Cheryl Phillips whipped out a staggering variety of examples in their presentations. Jonathan, in his keynote, talked about “A History of the Detainee Population” an interactive NYT graphic (partially excerpted below) depicting how Guantanamo prisoners have, over time, slowly been moved back to their country of origin. I would say that the presentation is effective. I “got” the message. But I also realize that, because the visualization is animated, it’s difficult to see the overall trend over time — to compare one year to the next. There are different ways to tell this story, some of which may be more effective than others for a range of storytelling goals.


Critical blogs such as The Why Axis and Graphic Sociology have arisen to try to fill the gap of understanding what works and what doesn’t. And research on visualization rhetoric has tried to situate narrative data visualization in terms of the rhetorical techniques authors may use to convey their story. Useful as these efforts are in their thick description and critical analysis, and for increasing visual literacy, they don’t go far enough toward building predictive theories of how data-visualization stories are “read” by the audience at large.

Corum, a graphics editor at NYT, has a descriptive framework to explain his design process and decisions. It describes the tensions between interactivity and story, between oversimplification and overwhelming detail, and between exploration and decoration. Other axes of design include elements such as focus versus depth and the author versus the audience. Author and educator Alberto Cairo exhibits similar sets of design dimensions in his book, “The Functional Art“, which start to trace the features along which data-visualization stories can vary (recreated below).

vis wheel

Such descriptions are a great starting point, but to make further progress on interactive data storytelling we need to know which of the many experiments happening out in the wild are having their desired effect on readers. Design decisions like how and where annotations are placed on a visualization, how the story is structured across the canvas and over time, the graphical style including things like visual embellishments and novelties, as well as data mapping and aggregation can all have consequences on how the audience perceives the story. How does the effect on the audience change when modulating these various design dimensions? A science of data-visualization storytelling should seek to answer that question.

But still the question looms: What does a data story that “works” even mean? While efficiency and parsimony of visual representation may still be important in some contexts, I believe online storytelling demands something else. What effects on the audience should we measure? As data visualization researcher Robert Kosara writes in his forthcoming IEEE Computer article on the subject, “there are no clearly defined metrics or evaluation methods … Developing these will require the definition of, and agreement on, goals: what do we expect stories to achieve, and how do we measure it?”

There are some hints in recent research in information visualization for how we might evaluate visualizations that communicate or present information. We might for instance ask questions about how effectively a message is acquired by the audience: Did they learn it faster or better? Was is memorable, or did they forget it 5 minutes, 5 hours, or 5 weeks later? We might ask whether the data story spurred any personal insights or questions, and to what degree users were “engaged” with the presentation. Engaged here could mean clicks and hovers of the mouse on the visualization, how often widgets and filters for the presentation were touched, or even whether users shared or conversed around the visualization. We might ask if users felt they understood the context of the data and if they felt confident in their interpretation of the story: Did they feel they could make an informed decision on some issue based on the presentation? Credibility being an important attribute for news outlets, we might wonder whether some data story presentations are more trustworthy than others. In some contexts a presentation that is persuasive is the most important factor. Finally, since some of the best stories are those that evoke emotional responses, we might ask how to do the same with data stories.

Measuring some of these factors is as straightforward as instrumenting the presentations themselves to know where users moved their mouse, clicked, or shared. There are a variety of remote usability testing services that can already help with that. Measuring other factors might require writing and attaching survey questions to ask users about their perceptions of the experience. While the best graphics departments do a fair bit of internal iteration and testing it would be interesting to see what they could learn by setting up experiments that varied their designs minutely to see how that affected the audience along any of the dimensions delineated above. More collaboration between industry and academia could accelerate this process of building knowledge of the impact of data stories on the audience.

I’m not arguing that the creativity and boundary-pushing in data-visualization storytelling should cease. It’s inspiring looking at the range of visual stories that artists and illustrators produce. And sometimes all you really want is an amuse yeux — a little bit of visual amusement. Let’s not get rid of that. But I do think we’re at an inflection point where we know enough of the design dimensions to start building models of how to reliably know what story designs achieve certain goals for different kinds of story, audience, data, and context. We stand only to be able to further amplify the impact of such stories by studying them more systematically.

How does newspaper circulation relate to Twitter following?

I was recently looking at circulation numbers from the Audit Bureau of Circulation for the top twenty-five newspapers in the U.S. and wondered: How does circulation relate to Twitter following? So for each newspaper I found the Twitter account and recorded the number of followers (link to data). The graph below shows the ratio of Twitter followers to total circulation; you could say it’s some kind of measure of how well the newspaper has converted its circulation into a social media following.

You can clearly see national papers like the NYT and Washington Post rise above the rest, but for others like USA Today it’s surprising that with a circulation of about 1.7M, they have comparatively few — only 514k — Twitter followers. This may say something about the audience of that paper and whether that audience is online and using social media. For instance, Pew has reported stats that suggest that people over the age of 50 use Twitter at a much lower than average rate. Another possible explanation is that a lot of the USA Today circulation is vapor; I can’t remember how many times I’ve stayed at a hotel where USA Today was left for me by default, only to be left behind unread. Finally, maybe USA Today is just not leading an effective social strategy and they need to get better about reaching, and appealing to, the social media audience.

There are some metro papers like NY Post and LA Times that also have decent ratios, indicating they’re addressing a fairly broad national or regional audience with respect to their circulation. But the real winners in the social world are NYT and WashPost, and maybe WSJ to some extent. And in this game of web scale audiences, the big will only get bigger as they figure out how to transcend their own limited geographies and expand into the social landscape.

newspaper graph

Neolithic Journalists? Influence Engines? Narrative Analytics? Some Thoughts on C+J

A few weeks ago now was the 2nd Computation + Journalism Symposium at Georgia Tech, which I helped organize and program. I wrote up a few reflections on things that jumped out at me from the meeting. Check them out on Nieman Lab.

Aha! Brainstorming App

In April 2012 I published a whitepaper on Cultivating Innovation in Computational Journalism with the CUNY Tow-Knight Center for Entrepreneurial Journalism. Jeff Jarvis wrote about it on the Tow-Knight blog, and the Nieman Lab even covered it.

Part of the paper developed a structured brainstorming activity called “Aha!” to help students and news industry professionals in thinking more about ways to combine ideas from technology, information science, user needs, and journalistic goals into useful new news products and services. We produced a printed deck of cards with different concepts that people could re-combine, and you can still get these cards from CUNY.

But really the Aha! Brainstorming activity was begging to be made into an app, which is now available on the Apple App Store. The app has the advantages that you can augment the re-combinable concepts, you can audio record your brainstorming sessions, take and store photos of any notes you scribble down about your ideas, and share the whole thing via email with your colleagues. If you have an iDevice be sure to check it out!

Understanding bias in computational news media

Just a quick pointer to an article I wrote for Nieman Lab exploring some of the ways in which algorithms serve to introduce bias into news media. Different kind of writing than my typical academic-ese, but fun.

Mobile Gaming Summit 2012

I have recently been getting more into mobile design and development and so was excited to attend the Mobile Gaming Summit in New York today. It was a well attended event, with what seemed like dozens of presenters from top mobile studios sharing tips on everything from user acquisition to design, mobile analytics, cross-platform development, finance, and social. What I wanted to share here quickly were some of the resources that were mentioned at the summit because I think they would be useful to any mobile studio / developer who’s just starting out (noobs like me!). So, by topic, here are some services to check out:

  • Ad Platforms for user acquisition
  • Analytics
    • Flurry (free analytics platform to help you understand how users are using your app)
    • Bees and Pollen (analytics to help optimize the user experience based on the user)
    • Apsalar
  • Cross-Platform Technologies
    • Corona (uses a language called Lua that I’ve never heard of)
    • Marmelade (program in c++, deploy to iOS, Android, xbox, etc.)
    • Phone Gap (program in javascript, HTML, CSS)
    • Unity (geared toward 3D games)

In general I was impressed with the amount of data driven design going on in the mobile apps / games space and how the big studios are really optimizing for attention, retention, and monetization by constantly tweaking things.

Other tips that were shared included things like: use Canada as a test market to work out kinks in your apps before you launch in the larger U.S. market; concentrate marketing efforts / budget in a short period of time to attain the highest rank in the app store as this drives more organic growth; the industry is heavily moving towards a free-to-play model with monetization done with in-app purchases or advertising.

In the next few weeks I’ll be excited to try out some of these services with my new app, Many Faces, which launched a couple weeks ago. I think it’s all about the user-acquisition / marketing at this point …

Comment Readers Want Relevance!

A couple years ago now I wrote a paper about the quality of comments on online news stories. For the paper I surveyed a number of commenters on sacbee.com about their commenting experience on that site. One of the aspects of the experience that users complained about was that comments were often off-topic: that comments weren’t germane, or relevant, to the conversation or to the article to which they were attached. This isn’t surprising, right? If you’ve ever read into an online comment thread you know there’s a lot of irrelevant things that people are posting.

It stands to reason then that if we can make news comments more relevant then people might come away more satisfied from the online commenting experience; that they might be more apt to read and find and learn new things if the signal to noise ratio was a bit higher. The point of my post here is to show you that there’s a straightforward and easy-to-implement way to provide this relevance that coincides with both users’ and editors notions of “quality comments”.

I collected data in July via the New York Times API, including 370 articles and 76,086 comments oriented around the topic of climate change. More specifically I searched for articles containing the phrase “climate change” and then collected all articles which had comments (since not all NYT articles have comments). For each comment I also had a number of pieces of metadata, including: (1) the number of times the comment was “recommended” by someone upvoting it, and (2) whether the comment was an “editor’s selection”. Both of these ratings indicate “quality”; one from the users’ point of view and the other from the editors’. And both of these ratings in fact correlate with a simple measure of relevance as I’ll describe next.

In the dataset I collected I also had the full text of both the comments and the articles. Using some basic IR ninjitsu I then normalized the text, stop-worded it (using NLTK), and stemmed the words using the Porter stemming algorithm. This leaves us with cleaner, less noisy text to work with. I then computed relevance between each comment and its parent article by taking the dot product (cosine distance) of unigram feature vectors of tf-idf scores. For the sake of the tf-idf scores, each comment was considered a document, and only unigrams that occurred at least 10 times in the dataset were considered in the feature vectors (again to reduce noise). The outcome of this process is that for each comment-article pair I now had a score (between 0 and 1) representing similarity in the words used in the comment and those used in the article. So a score of 1 would indicate that the comment and article were using identical vocabulary whereas a score of 0 would indicate that the comment and article used no words in common.

So, what’s interesting is that this simple-to-compute metric for relevance is highly correlated to the recommendation score and editor’s selection ratings mentioned above. The following graph shows the average comment to article similarity score over each recommendation score up to 50 (red dots), and a moving average trend line (blue).

As you get into the higher recommendation scores there’s more variance because it’s averaging less values. But you can see a clear trend that as the number of recommendation ratings increases so too does the average comment to article similarity. In statistical terms, Pearson’s correlation is r=0.58 (p < .001). There’s actually a fair amount of variance around each of those means though, and the next graph shows the distribution of similarity values for each recommendation score. If you turn your head side-ways each column is a histogram of the similarity values.

We can also look at the relationship between comment to article similarity in terms of editors’ selections, certain comments that have been elevated  in the user interface by editors. The average similarity for comments that are not editors’ selections is 0.091 (N=73,723) whereas for comments that are editors’ selections the average is 0.118 (N=2363). A t-test between these distributions indicates that the difference in means is statistically significant (p < .0001). So what we learn from this is that editors’ criteria for selecting comments also correlates to the similarity in language used between the comment and article.

The implications of these findings are relatively straightforward. A simple metric of similarity (or relevance) correlates well to notions of “recommendation” and editorial selection. This metric could be surfaced in a commenting system user interface to allow users to rank comments based on how similar they are to an article, without having to wait for recommendation scores or editorial selections. In the future I’d like to look into ways to assess how predicative such metrics are in terms of recommendation scores, as well as try out different metrics of similarity, like KL divergence.

Many Faces Photo Collages

I’ve been interested in photo collages for years. Those who know me well have likely seen my Many Faces from a few years ago (pictured above), which was inspired by some improv classes I was taking at the time. It was fun to put together, but also very time-consuming. A couple months ago I realized it would be fun to turn the concept into an app that could help quickly and easily make ManyFace-esque collages. I’m happy to say that the app has launched in the app store today. For a bit more info on the app you can also visit the website. Please check it out, and if you like it, share your ManyFaces on twitter or facebook.