Moving Towards Algorithmic Corroboration

Note: this is cross-posted on the Berkman/MIT “Truthiness in Digital Media” blog

One of the methods that truth seekers like journalists or social scientists often employ is corroboration. If we find two (or more) independent sources that reinforce each other, and that are credible, we gain confidence in the truth-value of a claim. Independence is key, since political, monetary, legal, or other connections can taint or at least place contingencies on the value of corroborated information.

How can we scale this idea to the web by teaching computers to effectively corroborate information claims online? An automated system could allow any page online to be quickly checked for misinformation. Violations could be flagged and highlighted, either for lack of corroboration or for a multi-faceted corroboration (i.e. a controversy).

There have already been a handful of efforts in the computing research literature that have looked at how to do algorithmic corroboration. But there is still work to do to define adequate operationalizations so that computers can be effective corroborators. First of all, we need to define and extract the units that are to be corroborated. Computers need to be able to differentiate a factually stated claim from a speculative or hypothetical one, since only factual claims can really be meaningfully corroborated. In order to aggregate statements we then need to be able to match two claims together while taking into account different ways of saying similar things. This includes the challenge of context, the tiniest change in which can alter the meaning of a statement and make it difficult for a computer to assess the equivalence of statements. Then, the simplest aggregation strategy might consider the frequency of a statement as a proxy for its truth-value (the more sources that agree with statement X, the more we should believe it) but this doesn’t take into the account the credibility of the source or their other relationships, which also need to be enumerated and factored in. We might want algorithms to consider other dimensions such as the relevance and expertise of the source to the claim, the source’s originality (or lack thereof), the prominence of the claim in the source, and the source’s spatial or temporal proximity to the information. There are many research challenges here!

Any automated corroboration method would rely on a corpus of information that acts as the basis for corroboration. Previous work like DisputeFinder has looked at scraping known repositories such as Politifact or Snopes to jump-start a claims database, and other work like Videolyzer has tried to leverage engaged people to provide structured annotations of claims, though it’s difficult to get enough coverage and scale through manual efforts. Others have proceeded by using the internet as a massive corpus. But there could also be an opportunity here for news organizations, who already produce and have archives of lots of credible and trustworthy text, to provide a corroboration service based on all of the claims embedded in those texts. A browser plugin could detect and highlight claims that are not corroborated by e.g. the NYT or Washington Post corpora. Could news organizations even make money off their archives like this?

It’s important not to forget that there are limits to corroboration too, both practical and philosophical. Hypothetical statements, opinions and matters of taste, or statements resting on complex assumptions may not benefit at all from a corroborative search for truth. Moreover, systemic bias can still go unnoticed, and a collective social mirage can guide us toward fonts of hollow belief when we drop our critical gaze. We’ll still need smart people around, but, I would argue, finding effective ways to automate corroboration would be a huge advance and a boon in the fight against a misinformed public.