Category Archives: information

Fact-Checking at Scale

Note: this is cross-posted on the CUNY Tow-Knight Center for Entrepreneurial Journalism site. 

Over the last decade there’s been a substantial growth in the use of Fact-Checking to correct misinformation in the public sphere. Outlets like and Politifact tirelessly research and assess the accuracy of all kinds of information and statements from politicians or think-tanks. But a casual perusal of these sites shows that there are usually only 1 or 2 fact-checks per day from any given outlet. Fact-Checking is an intensive research process that demands considerable skilled labor and careful consideration of potentially conflicting evidence. In a task that’s so labor intensive, how can we scale it so that the truth is spread far and wide?

Of late, Politifact has expanded by franchising its operations to states – essentially increasing the pool of trained professionals participating in fact-checking. It’s a good strategy, but I can think of at least a few others that would also grow the fact-checking pie: (1) sharpen the scope of what’s fact-checked so that attention is where it’s most impactful, (2) make use of volunteer, non-professional labor via crowdsourcing, and (3) automate certain aspects of the task so that professionals can work more quickly. In the rest of this post, I’ll flesh out each of these approaches in a bit more detail.

Reduce Fact-Checking Scope
“I don’t get to decide which facts are stupid … although it would certainly save me a lot of time with this essay if I were allowed to make that distinction.” argues Jim Fingal in his epic fact-check struggle with artist-writer John D’Agata in The Lifespan of a Fact. Indeed, some of the things Jim checks are really absurd: did the subject take the stairs or the elevator, did he eat “potatoes” or “french fries”; these things don’t matter to the point of that essay, nor, frankly, to me as the reader.

Fact-checkers, particularly the über-thorough kind employed by magazines, are tasked with assessing the accuracy of every claim or factoid written in an article (See the Fact Checker’s Bible for more). This includes hard facts like names, stats, geography, and physical properties as well as what sources claim via a quotation, or what the author writes from notes. Depending on the nature of the claim some of it may be subjective, opinion-based, or anecdotal. All of this checking is meant to protect the reputation of the publication and of the writers. To maintain trust with the public. But it’s a lot to check and the imbalance between content volume and critical attention will only grow.

To economize their attention fact-checkers might better focus on overall quality; who cares if they’re “potatoes” or “french fries”? In information science studies, the notion of quality can be defined as the “value or ‘fitness’ of the information to a specific purpose or use.” If quality is really what we’re after then fact-checking would be well-served and more efficacious if it focused the precious attention of fact-checkers on claims that have some utility. These are the claims that if they were false could impact the outcome of some event or an important decision. I’m not saying accuracy doesn’t matter, it does, but fact-checkers might focus more energy on information that impacts decisions. For health information this might involve spending more time researching claims that impact health-care options and choices; for finance it would involve checking information informing decisions about portfolios and investments. And for politics this involves checking information that is important for people’s voting decisions – something that the likes of Politifact already focus on.

Increased Use of Volunteer Labor
Another approach to scaling fact-checking is to incorporate more non-professionals, the crowd, in the truth-seeking endeavor. This is something often championed by social media journalists like Andy Carvin, who see truth-seeking as an open process that can involve asking for (and then vetting) information from social media participants. Mathew Ingram has written about how platforms like Twitter and Reddit can act as crowdsourced fact-checking platforms. And there have been several efforts toward systematizing this, notably the TruthSquad, which invited readers to post links to factual evidence that supports or opposes a single statement. A professional journalist would then write an in-depth report based on their own research plus whatever research the crowd contributed. I will say I’m impressed with the kind of engagement they got, though sadly it’s not being actively run anymore.

But it’s important to step back and think about what the limitations of the crowd in this (or any) context really are. Graves and Glaisyer remind us that we still don’t really know how much an audience can contribute via crowdsourced fact-checking. Recent information quality research by Arazy and Kopak gives us some clues about what dimensions of quality may be more amenable to crowd contributions. In their study they looked at how consistent ratings of various wikipedia articles were along dimensions of accuracy, completeness, clarity, and objectivity. They found that, while none of these dimensions had particularly consistent ratings, completeness and clarity were more reliable than objectivity or accuracy. This is probably because it’s easier to use a heuristic or shortcut to assess completeness, whereas rating accuracy requires specialized knowledge or research skill. So, if we’re thinking about scaling fact-checking with a pro-am model we might have the crowd focus on aspects of completeness and clarity, but leave the difficult accuracy work to the professionals.

#Winning with Automation
I’m not going to fool anyone by claiming that automation or aggregation will fully solve the fact-checking scalability problem. But there may be bits of it that can be automated, at least to a degree where it would make the life of a professional fact-checker easier or make their work go faster. An automated system could allow any page online to be quickly checked for misinformation. Violations could be flagged and highlighted, either for lack of corroboration or for controversy, or the algorithm could be run before publication so that a professional fact-checker could take a further crack at it.

Hypothetical statements, opinions and matters of taste, or statements resting on complex assumptions may be too hairy for computers to deal with. But we should be able to automatically both identify and check hard-facts and other things that are easily found in reference materials. The basic mechanic would be one of corroboration, a method often used by journalists and social scientists in truth-seeking. If we can find two (or more) independent sources that reinforce each other, and that are credible, we gain confidence in the truth-value of a claim. Independence is key, since political, monetary, legal, or other connections can taint or at least place contingencies on the value of corroborated information.

There have already been a handful of efforts in the computing research literature that have looked at how to do algorithmic corroboration. But there is still work to do to define adequate operationalizations so that computers can do this effectively. First of all, we need to define, identify, and extract the units that are to be corroborated. Computers need to be able to differentiate a factually stated claim from a speculative or hypothetical one, since only factual claims can really be meaningfully corroborated. In order to aggregate statements we then need to be able to match two claims together while taking into account different ways of saying similar things. This includes the challenge of context, the tiniest change in which can alter the meaning of a statement and make it difficult for a computer to assess the equivalence of statements. Then, the simplest aggregation strategy might consider the frequency of a statement as a proxy for its truth-value (the more sources that agree with statement X, the more we should believe it) but this doesn’t take into the account the credibility of the source or their other relationships, which also need to be enumerated and factored in. We might want algorithms to consider other dimensions such as the relevance and expertise of the source to the claim, the source’s originality (or lack thereof), the prominence of the claim in the source, and the source’s spatial or temporal proximity to the information. There are many challenges here!

Any automated corroboration method would rely on a corpus of information that acts as the basis for corroboration. Previous work like DisputeFinder has looked at scraping or accessing known repositories such as Politifact or Snopes to jump-start a claims database, and other work like Videolyzer has tried to leverage engaged people to provide structured annotations of claims. Others have proceeded by using the internetas a massive corpus. But there could also be an opportunity here for news organizations, who already produce and have archives of lots of credible and trustworthy text (e.g. rigorously fact-checked magazines), to provide a corroboration service based on all of the claims embedded in those texts. Could news organizations even make money by syndicating their archives like this?

There are of course other challenges to fact-checking that also need to be surmounted, such as the user-interface for presentation or how to effectively syndicate fact-checks across different media. In this essay I’ve argued that scale is one of the key challenges to fact-checking. How can we balance scope with professional, non-professional, and computerized labor to get closer to the truth that really matters?


Journalism as Information Science

The core activity of journalism basically boils down to this: knowledge production. It’s presented in various guises: stories, maps, graphics, interviews, and more recently even things like newsgames, but it all essentially entails the same basic components of information gathering, organizing, synthesizing, and publishing of new (sometimes just new to you) knowledge. To be sure, the particular flavor of knowledge is colored by the cultural milieu, ethics, and temporal constraints through which journalism extrudes information into knowledge. Journalists add value to information and news by making sense of it, making it more accessible and memorable, and putting it in context.

Many of the practices followed by journalists in the process of knowledge production can be mapped quite neatly to corresponding ideas in information science. Thankfully, information science studies knowledge production in a much more structured fashion, and in the rest of this post I’d like to surface some of that structure as a way for reflecting on what journalists do, and for thinking about how technology could enhance such processes.

Much of what journalists are engaged with on a day-to-day basis is in adding value to information. Raw data and information is harvested from the world, and as the journalist gathers it and makes sense of it, puts it in context, increases its quality, and frames it for decision making, it gets more and more valuable to the end-user. And by “value” I don’t necessarily mean monetary, but rather usefulness in meeting a user need. This point is important because it implies that the value of information is perceived and driven by user-needs in context. And the process is cyclical or recursive. The output of someone else, be it an article, tweet, or comment can be fed into the process for the next output.

Robert S. Taylor, one of the fathers of information studies at Syracuse University, wrote an entire book on value-added processes in information systems. Below I examine the processes that he described. There may be some information processes that journalists could learn to  do more effectively, with or without new tools. Taylor organized the processes into four broad categories:

  • Ease of Use: This includes information usability such as information architecture (i.e. how to order information), design (i.e. how to format and present information), and browseability. When journalists take a table of numbers and present them as a map or graph they are making that data far more accessible and usable; when they write a compelling story which incorporates those numbers it is also increasing value through usability. Physical accessibility is also important to ease of use, and there’s no doubt that the physical accessibility of information on a mobile or tablet is different than on a desktop.
  • Noise Reduction: This involves the processes of inclusion and exclusion with an understanding of relevance that may be informed by context or end-user needs. Journalists are constantly engaging as noise reducers as they assemble a story and decide what is relevant to include and what is not, and even by their very judgement of what is considered newsworthy. Summarization is another dimension of this, as is linking which provides access to other relevant information.
  • Quality: A lot of value is added to information by enhancing its quality. Quality decisions depend on quality information: garbage in, garbage out. Quality includes aspects of accuracy, comprehensiveness (i.e. completeness of coverage), currency, reliability (i.e. consistent and dependable), and validity. Journalists engage (sometimes) in factchecking to enhance accuracy, as well as corroboration of sources as a method to increase validity. Different end-user contexts and needs have different demands on quality: non-breaking news doesn’t have the same demands on currency for instance. Seeing as quality (i.e. a commitment to truth) is a central value of journalism, it stands to reason that tools built for journalism might consider new ways of enhancing quality.
  • Adaptability: The idea of adaptability is that information is most valuable when it meets specific needs of a person with a particular problem. This involves knowing what users’ information needs are. Another dimension is that of flexibility, providing a variety of ways to work with information. Oftentimes I think adaptability is addressed in journalism through nichification – that is one outlet specializes in a particular information need, like for example, Consumer Reports.

You can’t really argue that any of these processes aren’t important to the knowledge produced by journalists, and many (all?) of them are also important to others who produce knowledge. There are people out there specialized in some of these activities. For instance, my alma mater, Georgia Tech, pumps out masters degrees in Human Computer Interaction, which teaches you a whole lot about that first category above – ease of use. Journalism could benefit from more cross-functional teams with such specialists.

The question moving forward is: How can technology inform the design of new tools that enable journalists to add the above values to information? Quality seems like a likely target since it is so important in journalism. But aspects of noise reduction (summarization), and adaptability may also be well-suited to developing augmenting technologies. Moreover, newer forms of information (e.g. social media) are in need for new processes that can add value.

Usable Transparency

The NYT has recently been doing a lot of interactive pieces for the 2008 presidential election. One of these is an interactive chart presentation of different political polls done by different organizations. This isn’t quite game-y, though it could be if there were some additional features like being able to compare one poll to another, or to try to predict a future poll based on current polls for points. Anyway, the important point here is that these visualizations are based on some simple polling data, things like # of respondents, and % in favor of each candidate. The Times is transparent about this data in 2 ways, (1) by providing a link explaining eligibility for polls to be included in the chart and (2) by providing a link to the raw database dump of the data. The eligibility link speaks to data quality issues that can arise in the collection of data, which can lead to invalid results or bias. The database dump link speaks to the ability to peer behind the graphic to the actual data used to produce it.

It’s useful to draw a distinction between data and information here, data being raw sensor readings or direct observations and information being additional context and interpretation based on data. There’s a difference in what needs to be done in terms of transparency of data (which the Time did magnificently for the interactive polling piece) and transparency of information. This is because there is a layer of contextualization and interpretation that also needs to be explicated in order to be transparent about information. This touches on issues of individual and organizational biases since interpretation itself is influenced by these outside sources. Moreover interpretation can be something encoded into mathematical equations that produce information (derived values) from the actual raw data. Consider the mean of all polls for each candidate. This is a derived value, albeit one that most people understand readily, but nonetheless which takes an interpretive stance that a mean of polling data collected under different circumstances is meaningful. As we move from simple means to more complexity, a data driven model is really nothing more than a series of complex mathematical manipulations which interpret the data into a manageable form of information.

Here’s the crux: to be transparent about information (interpretation from data), journalists need a way to be express interpretations or manipulations, mathematical though they may be, in a way that is easily understood. This has direct bearing on games for journalism since the models on which games interpret the world will be important to explicate to consumers in the spirit of transparency. The problem alas is that math is inpenetrable to many. Imagine the Times providing a 3rd link for transparency, one which shows a nasty equation on top of which a simulation is built. This is important, because even though many people won’t take the time to understand it, the people that take the time to will be able to verify or understand the model. But what about the other people? They need Usable Transparency. I like to think that a simulation game like SimCity follows the principle of usable transparency – you don’t need to understand the simulation model to be able to make decisions in the game. The manual describes in prose what to do to alleviate trash problems, create more jobs, or reduce rush hour traffic jams. I think this is a useful paradigm that would serve journalists well in thinking about transparency as it relates to games. The collection of the data is important, check. The data itself is important, check. But the mathematical model which drives a simulation is important too. I would argue for a prose description of that model which itself is footnoted with grounding equations.

NYT Interactive Presidential Debates

The New York Times recently published an interactive application for exploring the video and transcripts from the presidential and vice-presidential debates. Actual debate content aside, the application is quite a usable foray into the realm of multimedia (video + transcript) interfaces. Seen here is a screen shot of the application from the 2nd presidential debate.

Overall the interface has a good “flow.” At the top is the ability to search for keywords and see where they showed up in the transcript. You can see the comparison between the word’s usage between Obama, McCain, and the moderator. Below this are two timelines, the problem is that while they are all intuitive, they are in the wrong hierarchical order. The top most timeline is the most “zoomed out,” but the next one down is the most “zoomed in.” Really they need to be re-ordered so that the middle timeline is the bottom most. This would be a more intuitive layout from least detailed to most detailed. What IS really nice about all of the timelines and what really helps navigation is all of the textual information that pops up when hovering. Also there’s some segmentation showing parts of the video where each of the debaters is speaking. I found it really helpful to be able to click any of these segments and navigate the video to there. There is some navigational integration with the transcript which is interesting too. For one you can click on a block of the transcript and that will navigate you to that section of the video. But still we’re dealing with blocks of text rather than individual words being linked into the video.

The other fantastic aspect of this tool is that it provides some level of integrated fact-checks. The fact checking is produced professionally by the Times and is presented as aligned with the different question segments.¬† It’s difficult to follow though because it’s in a tab which competes with the transcript itself and so you can’t see the context or anchor to where the fact checking is referring. It seems it would be a lot more helpful for comparison’s sake to be able to see both the transcript and also the fact checking at the same time. The other problem with the presentation of the fact checking is just that’s it’s really dense and hard to read through. Again, better contextualization with the video and the transcript would really help here.

The Journalism of Awareness

In The Elements of Journalism Kovach and Rosenstiel call it the “Awareness Instinct,” that basic human drive to know something about what’s going on beyond our direct experience. Sure, the gold standard for journalists is to give people the information they need to make the decisions that are important to themselves, their families, and their society, but in our attention starved culture can we settle for something less grandios? Where deep understanding and time-consuming sensemaking of an issue can’t be achieved there is still awareness; a recognition of the issue. And this awareness facilitates the human need to build common ground and community by allowing us to talk about news events with others. That is, common ground around a shared awareness of news allows us to build social connections with others in the community, to relate to others through a shared understanding. So, while some may think that merely being aware of a news event is paltry in comparison to really deeply understanding it, it does indeed carry with it great value. How do we enable awareness for news information?

Storytelling is one way to take information and make it interesting, relevant, and engaging to an audience. A way to make the significant matter to people. A way to raise awareness for a deeper issue by telling a good story. Another approach is to take raw data or information and to make it engaging through interaction. Games, information visualization, and other interactive data driven applications fit into this latter area. In this sense, the journalism of awareness can fully embrace new media as a vector for raising awareness for issues in the news, even if this new media falls short of that gold standard of journalism.

Here are some examples of what I mean by the Journalism of Awareness:

Online news quizzes of the sort found on facebook, for one, serve to raise awareness for news information. I think the quiz mechanic gets lambasted undeservingly for being “too simple” or “not interactive.” It’s raising awareness for news information without getting deep. That’s OK. If you get something wrong, you were still exposed to the quiz question and have a chance to go back afterwards and read the original news item if you care to. The downside is, if you’re not interested in news to begin with chances are you won’t go out of your way to try and complete a news quiz. The other downside is that someone has to sit there and write the questions and answers for these news quizzes: there’s a non-zero authoring cost.

Information visualization of the sort featured on Digg Labs is also a form of the journalism of awareness. These visualizations are dynamic and packed with information, but certainly don’t help you connect any dots. They’re there to provide an entry point to the information space, something that looks fun and visual to draw you in with enough of a snippet to get you interested in digging in. The upside here is that no authoring is necessary; Digg grabs the headline and first few sentences of the story as a summary automatically. There are LOTS of examples of calm, “ambient” visualizations which leave information scent in an environment to raise awareness.

Perhaps most promising for the journalism of awareness are those interactive games or applications that remediate already authored news content. This is because this opens up new avenues for engaging consumers and raising awareness for news using existing content. So for example we have the games featured on MSNBC’s NewsWare Site. While simple instantiations of classic arcade games, NewsBlaster and NewsBreaker use RSS news feeds to exposed the player of the game to pertinent headlines in the course of play. Another example of this is my own game, Audio Puzzler, a puzzle game which is played with short (~1 min) video snippets found online. The game is actually content agnostic, but when fed with news content such as video podcasts, it exposes people to the entire news video snippet in the course of solving the puzzle. These types of applications have the added benefit of engaging people who might not have otherwise been exposed to the information. This is in comparison to the quiz or info viz examples which presuppose an initial interest by the user. Perhaps in the course of playing, awareness is raised and questions spawned. That can help feed the awareness instinct and is perhaps a first step in getting people to actively engage the news.

Information Quality and Intentionality

My friend Kelly had some questions for me after my proposal last month and I’m finally getting around to thinking about some of them. One question that she had was about how a lot of low quality information (e.g. press releases, advertisements etc.) is not accidentally of low quality, but is rather intentionally biased to get a particular side across. Should a measure of information quality address the intentionality of the communication? Is it worse if something is misleading than mistaken?

Whether something is misleading or mistaken has to do with the intentionality of the communicator, however what one perceives in the end is still the same: lower quality information. I think it would be difficult to show that someone had the intention to mislead because that information is shared only by the creator of the information, or at best the institution. Based on just the end product there’s no way to know the intentionality. If we could tell if a communicator was intentionally misleading, we would be able to factor this in to their reputation score. However, there are some cues that can raise suspicions about intentionality such as the relationship of the communicator to advertisers, the political leanings of the communicator, and the funding source for the production of the information. But these aren’t smoking guns; just because a cell phone maker pays for a study on the dangers of cell phone use, it doesn’t necessarily mean that the results are biased. But it does give us pause to think about the intentions of the producer of the information.

So back to the original question: should information quality address intentional bias? Yes, I think it should, but since the true intentions of a communicator are hidden, we have to rely on the cues that I listed above. The more intentionally biased a source is, the more this should in turn affect their credibility rating; in fact this could be thought of as another facet of source annotation for the system that I’m building.