Category Archives: Twitter

Finding News Sources in Social Media

Whether it’s terrorist attacks in Mumbai, a plane crash landing on the Hudson River, or videos and reactions from a recently capsized cruise ship in Italy, social media has proven itself again and again to be a huge boon to journalists covering breaking news events. But at the same time, the prodigious amount of social media content posted around news events creates a challenge for journalists trying to find interesting and trustworthy sources in the din. A few recent efforts have looked at automatically identifying misinformation on Twitter, or automatically assessing credibility, though pure automation carries the risk of cutting human decision makers completely out of the loop. There aren’t many general purpose (or accessible) solutions out there for this problem either; services like Klout help identify topical authorities, and Storify and Storyful help in assembling social media content, but don’t offer additional cues for assessing credibility or trustworthiness.

Some research I’ve been doing (with collaborators at Microsoft and Rutgers) has been looking into this problem of developing cues and filters to enable journalists to better tap into social media. In the rest of this post I’ll to preview this forthcoming research, but for all the details you’ll want to see the CHI paper appearing in May and the CSCW paper appearing next month.

With my collaborators I built an application called SRSR (standing for “Seriously Rapid Source Review”) which incorporates a number of advanced aggregations, computations, and cues that we thought would be helpful for journalists to find and assess sources in Twitter around breaking news events. And we didn’t just build the system, we also evaluated it on two breaking news scenarios with seven super-star social media editors at leading local, national, and international news outlets.

The features we built into SRSR were informed by talking with many journalists and include facilities to filter and find eyewitnesses and archetypical user-types, as well as to characterize sources according to their implicit location, network, and past content. The SRSR interface allows the user to quickly scan through potential sources and get a feeling for whether they’re more or less credible and if they might make good sources for a story. Here’s a snapshot showing some content we collected and processed around the Tottenham riots.

Automatically Identifying Eyewitnesses
A core feature we built into SRSR was the ability to filter sources based on whether or not they were likely to be eyewitnesses. To determine if someone was an eyewitness we built an automatic classifier that looks at the text content shared by a user and compares it to a dictionary of over 700 key terms relating to perception, seeing, hearing, and feeling – the kind of language you would expect from eyewitnesses. If a source uses one of the key terms then we label them as a likely eyewitness. Even using this relatively simple classifier we got fairly accurate results: precision was 0.89 and recall was 0.32. This means that if a source uses one of these words it’s highly likely they are really an eyewitness to the event, but that there were also a number of eyewitnesses who didn’t use any of these key words (thus the lower recall score). Being able to rapidly find eyewitnesses with 1st hand information was one of the most liked features in our evaluation. In the future there’s lot’s we want to do to make the eyewitness classifier even more accurate.

Automatically Identifying User Archetypes
Since different types of users on Twitter may produce different kinds of information we also sought to segment users according to some sensible archetypes: journalists/bloggers, organizations, and “ordinary” people. For instance, around a natural hazard news event, organizations might share information about marshaling public resources or have links to humanitarian efforts, whereas “ordinary” people are more likely to have more eyewitness information. We thought it could be helpful to journalists to be able to rapidly classify sources according to these information archetypes and so we built an automatic classifier for these categories. All of the details are in the CSCW paper, but we basically got quite good accuracy with the classifier across these three categories: 90-95%. Feedback in our evaluation indicated that rapidly identifying organizations and journalists was quite helpful.

Visually Cueing Location, Network, Entities
We also developed visual cues that were designed to help journalists assess the potential verity and credibility of a source based on their profile. In addition to showing the location of the source, we normalized and aggregated locations within a sources’s network. In particular we looked at the “friends” of a source (i.e. people that I follow and that follow me back) and show the top three most frequent locations in that network. This gives a sense of where this source knows people and has their social network. So even if I don’t live in London, if I know 50 people there it suggests I have a stake in that location or may have friends or other connections to that area that make me knowledgable about it. Participants in our evaluation really liked this cue as it gives a sense of implicit or social location. 

We also show a small sketch of the network of a source indicating who has shared relevant event content and is also following the source. This gives a sense of whether many people talking about the news event are related to the source. Journalists in our evaluation indicated that this was a nice credibility cue. For instance, if the Red Cross is following a source that’s a nice positive indicator.

Finally, we aggregated the top five most frequent entities (i.e. references to corporations, people, or places) that a source mentioned in their Twitter history (we were able to capture about 1000 historical messages for each person). The idea was that this could be useful to show what a source talks about, but in reality our participants didn’t find this feature that useful for the breaking news scenarios they were presented with. Perhaps in other scenarios it could still be useful?

What’s Next
While SRSR is a nice step forward there’s still plenty to do. For one, our prototype was not built for real-time events and was tested with pre-collected and processed data due to limitations of the Twitter API (hey Twitter, give me a call!!). And there’s plenty more to think about in terms of enhancing the eyewitness classifier, thinking about different ways to use network information to spider out in search of sources, and to experiment with how such a tool can be used to cover different kinds of events.

Again, for all the gory details on how these features were built and tested you can read our research papers. Here are the full references:

  • N. Diakopoulos, M. De Choudhury, M. Naaman. Finding and Assesing Social Media Information Sources in the Context of Journalism. Conference on Human Factors in Computing Systems (CHI). May, 2012. [PDF]
  • M. De Choudhury, N. Diakopoulos, M. Naaman. Unfolding the Event Landscape on Twitter: Classification and Exploration of User Categories. Proc. Conference on Computer Supported Cooperative Work (CSCW). February, 2012. [PDF]


News Headlines and Retweets

How do you maximize the reach and engagement of your tweets? This is a hugely important question for companies who want to maximize the value of their content. There are even start-ups, like Social Flow, that specialize in optimizing the “engagement” of tweets by helping to time them appropriately. A growing body of research is also looking at what factors, both of the social network and of the content of tweets, impact how often tweets get retweeted. For instance, some of this research has indicated that tweets are more retweeted when they contain URLs and hashtags, when they contain negative or exciting and intense sentiments, and when the user has more followers. Clearly time is important too and different times of day or days of week can also impact the amount of attention people are paying to social media (and hence the likelihood that something will get retweeted).

But aside from the obvious thing of growing their follower base, what can content creators like news organizations do to increase the retweetability of their tweets? Most news organizations basically tweet out headlines and links to their stories. And that delicate choice of words in writing a headline has always been a bit of a skill and an art. But with lots of data now we can start being a bit more scientific by looking at what textual and linguistic features of headlines tend to be associated with higher levels of retweets. In the rest of this post I’ll present some data that starts to scratch at the surface of this.

I collected all tweets from the @nytimes twitter account between July 1st, 2011 and Sept. 30th, 2011 using the Topsy API. I wanted to analyze somewhat older tweets to make sure that retweeting had run its natural course and that I wasn’t truncating the retweeting behavior. Using data from only one news account has the advantage that it controls for the network and audience and allows me to focus purely on textual features. In all I collected 5101 tweets, including how many times each tweet was retweeted (1) using the built-in retweet button and (2) using the old syntax of “RT @username”. Of these tweets, 93.7% contained links to NYT content, 1.0% contained links to other content (e.g. yfrog, instagram, or government information), and 0.7% were retweets themselves. The remaining 4.6% of tweets in my sample had no link.

The first thing I looked at was what the average number of retweets was for the tweets in each group (links to NYT content, links to other content, and no links).

  • Average # of RTs for tweets with links to NYT content: 48.0
  • Average # of RTs for tweets with links to other content: 48.1
  • Average # of RTs for tweets with no links: 83.8

This is interesting because some of the best research out there suggests that tweets WITH links get more RTs. But I found just the opposite: tweets with NO LINKS got more RTs (1.74 times as many on average).  I read through the tweets with no links (there’s only 234) and they were mostly breaking news alerts like “Qaddafi Son Arrested…“, “Dow drops more than 400 points…“, or “Obama and Boehner Close to Major Budget Deal…“. So from the prior research we know that for any old tweet source, URLs are a signal that is correlated with RTs, but for news organizations, the most “newsy” or retweetable information comes in a brief snippet, without a link. The implication is not that news organization should stop linking their content to get more RTs, but rather that the kind of information shared without links from news organizations (the NYT in particular) is highly retweetable.

To really get into the textual analysis I wanted to look just at tweets with links back to NYT content though. So the rest of the analysis was done on the 4780 tweets with links back to NYT content. If you look at these tweets they basically take the form: <story headline> + <link>. I broke the dataset up into the top and bottom 10% of tweets (deciles) as ranked by their total number of RTs, which includes RTs using the built-in RT button as well as the old style RTs. The overall average # of RTs was 48.3, but in the top 10% of tweets it was 173 and in the bottom 10% it was 7.4. Here’s part of the distribution:

Is length of a tweet related to how often it gets retweeted? I looked at the average length of the tweets (in characters) in the top and bottom 10%.

  • Top 10%: 75.8 characters
  • Bottom 10%: 82.8 characters

This difference is statistically significant using a t-test (t=5.23, p < .0001). So tweets that are in the top decile of RTs are shorter, on average, by about 7 characters. This isn’t prescriptive, but it does suggest an interesting correlation that headline / tweet writers for news organizations might consider exploring.

I also wanted to get a feel for what words were used more frequently in either the top or bottom deciles. To do this I computed the frequency distribution of words for each dataset (i.e. how many times each unique word was used across all the tweets in that decile). Then for each word I computed a ratio indicating how frequent it was in one decile versus the other. If this ratio is above 1 then it indicates that that word is more likely to occur in one decile than the other. I’ve embedded the data at the end of this post in case you want to see the top 50 words ranked by their ratio for both the top and bottom deciles.

From scanning the word lists you can see that pronouns (e.g. “I, you, my, her, his, he” etc.) are used more frequently in tweets from the bottom decile of RTs. Tweets that were in the top decile of RTs were more likely to use words relating to crime (e.g. “police”, “dead”, “arrest”), natural hazards (“irene”, “hurricane”, “earthquake”), sports (“soccer”, “sox”), or politically contentious issues (e.g. “marriage” likely referring to the legalization of gay marriage in NY). I thought it was particularly interesting that “China” was much more frequent in highly RTed tweets. To be clear, this is just scratching the surface and I think there’s a lot more interesting research to do around this, especially relating to theories of attention and newsworthiness.

The last bit of data analysis I did was to look at whether certain parts of speech (e.g. nouns, verbs, adjectives) were used differently in the top and bottom RT deciles. More specifically I wanted to know: Are different parts of speech used more frequently in one group than the other? To do this, I used a natural language processing toolkit (NLTK) and computed the parts of speech (POS) of all of the words in the tweets. Of course this isn’t a perfect procedure and sometimes the POS tagger makes mistakes, but I consider this analysis preliminary. I calculated the Chi-Square test to see if there was a statistical difference in the frequency of nouns, adverbs, conjunctions (e.g. “and”, “but”, etc.), determiners (e.g. “a”, “some”, “the”, etc.), pronouns, and verbs used in either the top or bottom 10% of RTs. What I found is that there is a strong statistically significant difference for adverbs (p < .02), determiners (p < .001), and verbs (p < .003), and somewhat of a difference for conjunctions (p = .06). There was no difference in usage for adjectives, nouns, or pronouns. Basically what this boils down to is that, in tweets that get lots of RTs, adverbs, determiners (and conjunctions somewhat) are used substantially less, while verbs are used substantially more. Perhaps it’s the less frequent use of determiners and adverbs that (as described above) makes these tweets shorter on average. Again, this isn’t prescriptive, but there may be something here in terms of how headlines are written. More use of verbs, and less use of “empty” determiners and conjunctions in tweets is correlated with higher levels of retweeting. Could it be the case that action words (i.e. verbs) somehow spur people to retweet the headline? Pinning down the causality of this is something I’ll be working on next!

Here are the lists of words I promised. If you find anything else notable, please leave a comment!

Wikileaks and Collaborative Sensemaking?

Thanks to the head start of the likes of the New York Times, the Guardian, and der Spiegel we now have some excellent written reporting on a few of the more important issues exposed in the wikileaks cablegate data. There have also been a number of visualizations of the dataset published in the last few days (e.g. Infothetics has a nice round-up), which help, to some extent, in browsing and making sense of all of the data there.

But what I want to suggest here is that, with all of the attention that this story is getting, that there may be some useful information to mine from social media about what is interesting, important, and noteworthy in the dataset. One of the most useful aspects of social media such as Twitter is that it provides a platform where interested individuals can make observations about what’s going on around them, including observations of large collections of documents.

At Rutgers, where I work, we’ve been developing a social media visual analytics tool call Vox Civitas and have collected a dataset of almost 60,000 (and growing) English language tweets marked with the hashtag “#cablegate” from Twapperkeeper. Vox provides the ability to visualize the collection over time, with sentiment, and includes capabilities for filtering according to many criteria. Without further ado, click here to see the cablegate dataset in Vox. Let us know what you find or if it inspires any follow-up work!