Category Archives: HCI

Finding News Sources in Social Media

Whether it’s terrorist attacks in Mumbai, a plane crash landing on the Hudson River, or videos and reactions from a recently capsized cruise ship in Italy, social media has proven itself again and again to be a huge boon to journalists covering breaking news events. But at the same time, the prodigious amount of social media content posted around news events creates a challenge for journalists trying to find interesting and trustworthy sources in the din. A few recent efforts have looked at automatically identifying misinformation on Twitter, or automatically assessing credibility, though pure automation carries the risk of cutting human decision makers completely out of the loop. There aren’t many general purpose (or accessible) solutions out there for this problem either; services like Klout help identify topical authorities, and Storify and Storyful help in assembling social media content, but don’t offer additional cues for assessing credibility or trustworthiness.

Some research I’ve been doing (with collaborators at Microsoft and Rutgers) has been looking into this problem of developing cues and filters to enable journalists to better tap into social media. In the rest of this post I’ll to preview this forthcoming research, but for all the details you’ll want to see the CHI paper appearing in May and the CSCW paper appearing next month.

With my collaborators I built an application called SRSR (standing for “Seriously Rapid Source Review”) which incorporates a number of advanced aggregations, computations, and cues that we thought would be helpful for journalists to find and assess sources in Twitter around breaking news events. And we didn’t just build the system, we also evaluated it on two breaking news scenarios with seven super-star social media editors at leading local, national, and international news outlets.

The features we built into SRSR were informed by talking with many journalists and include facilities to filter and find eyewitnesses and archetypical user-types, as well as to characterize sources according to their implicit location, network, and past content. The SRSR interface allows the user to quickly scan through potential sources and get a feeling for whether they’re more or less credible and if they might make good sources for a story. Here’s a snapshot showing some content we collected and processed around the Tottenham riots.

Automatically Identifying Eyewitnesses
A core feature we built into SRSR was the ability to filter sources based on whether or not they were likely to be eyewitnesses. To determine if someone was an eyewitness we built an automatic classifier that looks at the text content shared by a user and compares it to a dictionary of over 700 key terms relating to perception, seeing, hearing, and feeling – the kind of language you would expect from eyewitnesses. If a source uses one of the key terms then we label them as a likely eyewitness. Even using this relatively simple classifier we got fairly accurate results: precision was 0.89 and recall was 0.32. This means that if a source uses one of these words it’s highly likely they are really an eyewitness to the event, but that there were also a number of eyewitnesses who didn’t use any of these key words (thus the lower recall score). Being able to rapidly find eyewitnesses with 1st hand information was one of the most liked features in our evaluation. In the future there’s lot’s we want to do to make the eyewitness classifier even more accurate.

Automatically Identifying User Archetypes
Since different types of users on Twitter may produce different kinds of information we also sought to segment users according to some sensible archetypes: journalists/bloggers, organizations, and “ordinary” people. For instance, around a natural hazard news event, organizations might share information about marshaling public resources or have links to humanitarian efforts, whereas “ordinary” people are more likely to have more eyewitness information. We thought it could be helpful to journalists to be able to rapidly classify sources according to these information archetypes and so we built an automatic classifier for these categories. All of the details are in the CSCW paper, but we basically got quite good accuracy with the classifier across these three categories: 90-95%. Feedback in our evaluation indicated that rapidly identifying organizations and journalists was quite helpful.

Visually Cueing Location, Network, Entities
We also developed visual cues that were designed to help journalists assess the potential verity and credibility of a source based on their profile. In addition to showing the location of the source, we normalized and aggregated locations within a sources’s network. In particular we looked at the “friends” of a source (i.e. people that I follow and that follow me back) and show the top three most frequent locations in that network. This gives a sense of where this source knows people and has their social network. So even if I don’t live in London, if I know 50 people there it suggests I have a stake in that location or may have friends or other connections to that area that make me knowledgable about it. Participants in our evaluation really liked this cue as it gives a sense of implicit or social location. 

We also show a small sketch of the network of a source indicating who has shared relevant event content and is also following the source. This gives a sense of whether many people talking about the news event are related to the source. Journalists in our evaluation indicated that this was a nice credibility cue. For instance, if the Red Cross is following a source that’s a nice positive indicator.

Finally, we aggregated the top five most frequent entities (i.e. references to corporations, people, or places) that a source mentioned in their Twitter history (we were able to capture about 1000 historical messages for each person). The idea was that this could be useful to show what a source talks about, but in reality our participants didn’t find this feature that useful for the breaking news scenarios they were presented with. Perhaps in other scenarios it could still be useful?

What’s Next
While SRSR is a nice step forward there’s still plenty to do. For one, our prototype was not built for real-time events and was tested with pre-collected and processed data due to limitations of the Twitter API (hey Twitter, give me a call!!). And there’s plenty more to think about in terms of enhancing the eyewitness classifier, thinking about different ways to use network information to spider out in search of sources, and to experiment with how such a tool can be used to cover different kinds of events.

Again, for all the gory details on how these features were built and tested you can read our research papers. Here are the full references:

  • N. Diakopoulos, M. De Choudhury, M. Naaman. Finding and Assesing Social Media Information Sources in the Context of Journalism. Conference on Human Factors in Computing Systems (CHI). May, 2012. [PDF]
  • M. De Choudhury, N. Diakopoulos, M. Naaman. Unfolding the Event Landscape on Twitter: Classification and Exploration of User Categories. Proc. Conference on Computer Supported Cooperative Work (CSCW). February, 2012. [PDF]


HCI’s Teachings on Transparency II

In this post I’ll continue trying to glean knowledge from the study of transparency of interactive systems in HCI, which I began in an earlier post.

Back in the mid 1990’s there was a flurry of activity in HCI in trying to understand the explainability and transparency of interactive systems. Paul Dourish published extensively in the area and is known for his book, Where the Action Is: The Foundations of Embodied Interaction, which (among other things) connects ideas from ethnomethodology with those of technology and system transparency.

A key concept studied in relation to ethnomethodology is that of accountability, meaning “observable and reportable” or able to be made sense of in the context in which an action arises. It addresses not just the result or outcome of an action but also includes how the result was achieved. Dourish sums it up thus, “Put simply it says that because we know that people don’t just take things at face value but attempt to interrogate them for their meaning, we should provide some facilities so that they can do the same thing with interactive systems. Even more straightforwardly, it’s a good idea to build systems that tell you what they’re doing.”

An account then is something that provides accountability in a software interface. The goal of an account is to provide some explanation for how the sequence of actions up to a moment results in a system’s current configuration. Why did each action in the interface affect the state in the way that it did? This is extremely similar to the notion of the transparency of mechanics that I developed in a previous post. Too bad Dourish beat me by a decade or so.

In his paper, Accounting for System Behavior: Representation, Reflection and Resourceful Action, Dourish posits a compelling definition for an account: “Accounts are causally-connected representations of system action which systems offer as explications of their own activity. They are inherently partial and variable, selectively highlighting and hiding aspects of the inherent structure of the systems they represent.” The notion of partiality of accounts is troubling with respect to journalistic transparency since information exclusion entails a danger of bias. But journalistic transparency can be maintained even in partiality if decisions about inclusion / exclusion are explicated. Decisions about inclusion / exclusion can however also be made algorithmically, which confounds the problem for interactive systems. The classic example is in the (lack of) transparency of ranking algorithms used in online search engines.

Another connection that I see to journalistic notions of transparency is that accounts are context sensitive: more general statements of transparency are less context specific whereas less general statements embedded in the actual context of the running system are highly context specific. “The account that matters is one that is good enough for the needs and purposes at hand, in the circumstances in which it arises and for those who are involved in the activity,” writes Dourish in Where the Action Is. What are the needs of the user in some particular situation? A journalist writing interactive software would need to answer the question: “What states need to be observable?”.

Furthermore, in journalism, transparency happens at varying degrees and levels of granularity and is thought of in a practical light where, for instance, it would not make sense to be transparent about all of a reporter’s notes in a newspaper since there are space constraints. Practicality, efficiency of communication, and usability of an interface, can be subverted if everything must be transparent. What is the appropriate level of transparency, both mechanical and journalistic, for interactive games and info graphics?

Johnson and Johnson have also written about another important facet of transparency that is relevant here. The nature of the knowledge that is being made transparent, whether declarative or procedural knowledge can have an impact on how that transparency is presented. Is it easily citable or does a complex process need to be explicated? I think this gets manifested in journalistic transparency as a difference between transparency of reference and transparency of construction.