Tag Archives: computational journalism

What’s in a Ranking?

The web is a tangled mess of pages and links. But through the magic of the Google algorithm it becomes a nice and neatly ordered rank of “relevance” to whatever our heart desires. The network may be the architecture of the web, but the human ideology projected on that network is the rank.

Often enough we take rankings at face value; we don’t stop to think about what’s really in a rank. There is tremendous power conferred upon the top N, of anything really, not just search results but colleges, restaurants, or a host of other goods. These are the things that get the most attention and become de facto defaults because they are easier for us to access. In fact we rank all manner of services around us in our communities: schools, hospitals and doctors, even entire neighborhoods. Bloomberg has an entire site dedicated to them. These rankings have implications for a host of decisions we routinely make. Can we trust them to guide us?

Thirty years ago, rankings in the airline reservation systems used by travel agents were regulated by the U.S. government. Such regulation served to limit the ability of operators to “bias travel-agency displays” in a way that would privilege some flights over others. But this regulatory model for reigning in algorithmic power hasn’t been applied in other domains, like search engines. It’s worth asking why not and what that regulation might look like, but it’s also worth thinking about alternatives to regulation that we might employ for mitigating such biases. For instance we might design advanced interfaces that transparently signal the various ways in which a rank and the scores and indices on which it is built are constituted.

Consider an example from the local media, the “Best Neighborhoods” app, published by the Dallas Morning News (shown below). It ranks various neighborhoods according to criteria like the schools, parks, commute, and walkability. The default ranking of “overall” though is unclear: How are these various criteria weighted? And how are the various criteria even defined? What does “walkability” mean in the context of this app? If I am looking to invest in property I might be misled by a simplified algorithm; does it really measure the dimensions that are of most importance? While we can interactively re-rank by any of the individual criteria, many people will only see the default ranking anyway. Other neighborhood rankings, like the one from the New Yorker in 2010, do show the weights, but they’re non-interactive.


The notion of algorithmic accountability is something I’ve written about here previously. It’s the idea that algorithms are becoming more and more powerful arbiters of our decision making, both in the corporate world and in government. There’s an increasing need for journalists to think critically about how to apply algorithmic accountability to the various rankings that the public encounters in society, including rankings (like neighborhood rankings) that their own news organizations may publish as community resources.

What should the interface be for an online ranking so that it provides a level of transparency to the public? In a recent project with the IEEE, we sought to implement an interface for end-users to interactively re-weight and visualize how their re-weightings affected a ranking. But this is just the start: there is exciting work to do in human-computer interaction and visualization design to determine the most effective ways to expose rankings interactively in ways that are useful to the public, but which also build credibility. How else might we visualize the entire space of weightings and how they affect a ranking in a way that helps the public understand the robustness of those rankings?

When we start thinking about the hegemony of algorithms and their ability to generalize nationally or internationally there are also interesting questions about how to adapt rankings for local communities. Take something like a local school ranking. Rankings by national or state aggregators like GreatSchools may be useful, but they may not reflect how an individual community would choose to weight or even select criteria for inclusion in a ranking. How might we adapt interfaces or rankings so that they can be more responsive to local communities? Are there geographically local feedback processes than might allow rankings to reflect community values? How might we enable democracy or even voting on local ranking algorithms?

In short, this is a call for more reflection on how to be transparent about the data-driven rankings we create for our readers online. There are research challenges here, in human-centered design, in visualization, and in decision sciences that if solved will allow us to build better and more trustworthy experiences for the public served by our journalism. It’s time to break the tyranny of the unequivocal ranking and develop new modes of transparency for these algorithms.

OpenVis is for Journalists!

Note: A version of the following also appears on the Tow Center blog.

Last week I attended the OpenVis Conference in Boston, a smorgasbord of learning dedicated to exploring the use and application of data visualization on the open web, so basically not using proprietary standards. It was hard not to get excited, with a headline keynote like Mike Bostock, the original creator of the popular D3 library for data visualization and now a graphics editor at the New York Times.

Given that news organizations are leading the way with online data storytelling, it was perhaps unsurprising to find a number of journalists presenting at the conference. Kennedy Elliot of the Washington Post talked about coding for the news, imploring attendees to think more like journalists. And we also heard from Lisa Strausfeld and Christopher Cannon who run the new Bloomberg Visual Data lab, and from Lena Groeger at ProPublica who spoke about “thinking small” in visualization.

But even the less overtly journalistic talks somehow seemed to have strong ties and implications for journalism, on everything from storytelling and authoring tools to analytic methodologies. Let me pick on just a few talks that exposed some particularly relevant implications for data journalism.

First up, David Mimno, a professor at Cornell, gave a tour of his work in visualizating machine learning algorithms online to help students learn how those algorithms are working. He demonstrated old classics like k-means and linear regression, but the algorithms were palpable seeing them come to life through animated visualizations. Another example of this comes from the machine learning demos page, which animates and presents an even greater number of algorithms. Where I think this gets really important for journalists is with the whole idea of algorithmic accountability, and the ability to use visualization as a way for journalists to be transparent about the algorithms they use in their reporting.

A good example of where this is already happening is the explanation of the NYT4thDownBot where authors Brian Burke and Kevin Queally use a visualization of a football field (shown below) to explain how their predictive model differs from what actual football coaches tend to do. To the extent that algorithms are deserving of our scrutiny, visualization methods to communicate what they are doing and to somehow make them legible to the public seems incredibly powerful and important for us to work more on.

Alexander Howard recently wrote about “the difficult, complicated process of reporting on data as a source” while being as open and transparent as possible. If there’s one thing the recent launch of 538 has taught us is that there’s a need (and demand) to make the data, and even the code or models, available for data journalism projects.

People are already developing workflows and tools to make this possible online. Another great talk at OpenVis was by Dr. Jake Vanderplas, an astrophysicist working at the University of Washington, who has developed some really amazing open source technology that lets you create interactive D3 visualizations in the browser directly from IPython notebooks. Jake’s work on visualization takes us one step closer to enabling a complete end-to-end workflow for data journalists: data, analysis, and code can sit in the browser and directly render interactive visualizations for the end user. The whole stack is transparent and could potentially even enable the user to tweak, tune, or test variations. To the extent that reproducibility of data journalism projects becomes important to maintain the trust of the audience these sorts of platforms are certainly worth learning more about.

Because of its emphasis on openness and the relationship to transparency and the desire to create news content online, expect OpenVis to continue to develop next year as a key destination for journalists looking to learn more about visualization.

The Future of Automated Story Production

Note: this is cross-posted on the CUNY Tow-Knight Center for Entrepreneurial Journalism site. 

Recently there’s been a surge of interest in automatically generating news stories. The poster child is a start-up called Narrative Science which has earned coverage by the likes of the New York Times, Wired, and numerous blogs for its ability to automatically produce actual, readable stories of things like sports games or companies’ financial reports based on nothing more than numeric data. It’s impressive stuff, but it doesn’t stop me from thinking: What’s next? In the rest of this post I’ll talk about some challenges, such as story schema and modality, data context, and text transparency, that could improve future story generation engines.

Without inside information we can’t say for sure exactly how Narrative Science (NS) works, though there are some academic systems out there that provide a suitable analogue for description. There are two main phases that have to be automated in order to produce a story this way: the analysis phase and the generative phase. In the analysis phase, numeric data is statistically analyzed for things like trends, clusters, patterns, and outliers or exceptions. The analysis phase also includes the challenging aspect of condensing or selecting the most interesting things to include in the story (see Ramesh Jain’s “Extreme Stories” for more on this).

Followed by analysis and selection comes the task of figuring out an interesting structure to order the information in the story, a schema. Narrative Science differentiates itself primarily, I think, by paying close attention to the structure of the stories it generates. Many of the precursors to NS were stuck in the mode of presenting generated text in a chronological schema, which, as we know is quite boring for most stories. Storytelling is really all about structure: providing the connections between aspects of the story, its actors and setting, using some rhetorical ordering that makes sense for and engages the reader. There are whole books written on how to effectively structure stories to explore different dramatic arcs or genres. Many of these different story structures have yet to be encoded in algorithms that generate text from data, so there’s lots of room for future story generation engines to explore diverse text styles, genres, and dramatic arcs.

It’s also important to remember that text has limitations on the structures and the schema it supports well. A textual narrative schema might draw readers in, but, depending on the data, a network schema or a temporal schema might expose different aspects of a story that aren’t apparent, easy, or engaging to represent in text. This leads us to another opportunity for advancement in media synthesis: better integration of textual schema with visualization schemas (e.g. temporal, hierarchical, network). For instance, there may be complementary stories (e.g. change over time, comparison of entities) that are more effectively conveyed through dynamic visualizations than through text. Combining these two modalities has been explored in some research but there is much work to do in thinking about how best to combine textual schema with different visual schema to effectively convey a story.

There has also been recent work looking into how data can be used to generate stories in the medium of video. This brings with it a whole slew of challenges different than text generation, such as the role of audio, and how to crop and edit existing video into a coherent presentation. So, in addition to better incorporating visualization into data-driven stories I think there are opportunities to think about automatically composing stories from such varied modalities as video, photos, 3D, games, or even data-based simulations. If you have the necessary data for it, why not include an automatically produced simulation to help communicate the story?

It may be surprising to know that text generation from data has actually been around for some time now. The earliest reference that I found goes back 26 years to a paper that describes how to automatically create written weather reports based on data. And then ten years ago, in 2002, we saw the launch of Newsblaster, a complex news summarization engine developed at Columbia University that took articles as a data source and produced new text-based summaries using articles clustered around news events. It worked all right, though starting from text as the data has its own challenges (e.g. text understanding) that you don’t run into if you’re just using numeric data. The downside of using just numeric data is that it is largely bereft of context. One way to enhance future story generation engines could be to better integrate text generated by numeric data together with text (collected from clusters of human-written articles) that provides additional context.

The last opportunity I’d like to touch on here relates to the journalistic ideal of transparency. I think we have a chance to embed this ideal into algorithms that produce news stories, which often articulate a communicative intent combined with rules or templates that help achieve that intent. It is largely feasible to link any bit of generated text back to the data that gave rise to that statement – in fact it’s already done by Narrative Science in order to debug their algorithms. But this linking of data to statement should be exposed publicly. In much the same way that journalists often label their graphics and visualizations with the source of their data, text generated from data should source each statement. Another dimension of transparency practiced by journalists is to be up-front about the journalist’s relationship to the story (e.g. if they’re reporting on a company that they’re involved with). This raises an interesting and challenging question of self-awareness for algorithms that produce stories. Take for instance this Forbes article produced by Narrative Science about New York Times Co. earnings. The article contains a section on “competitors”, but the NS algorithm isn’t smart enough or self-aware enough to know that it itself is an obvious competitor. How can algorithms be taught to be transparent about their own relationships to stories?

There are tons of exciting opportunities in the space of media synthesis. Challenges like exploring different story structures and schemas, providing and integrating context, and embedding journalistic ideals such as transparency will keep us more than busy in the years and, likely, decades to come.

Cultivating the Landscape of Innovation in Computational Journalism

For the last several months I’ve been working on a whitepaper for the CUNY Tow-Knight Center for Entrepreneurial Journalism. It’s about cultivating more technical innovation in journalism and involves systematically mapping out what’s been done (in terms of research) as well as outlining a method for people to generate new ideas in computational journalism. I’m happy to say that the paper was published by the Tow-Knight Center today. You can get Jeff Jarvis’ take on it on the Tow-Knight blog, or for more coverage you can see the Nieman Lab write-up. Or go straight for the paper itself.

Systematic Technical Innovation in Journalism

The idea that innovation can be an organized, systematic search for change is not new — Peter Drucker wrote about it over 25 years ago in his book Innovation and Entrepreneurship — and I’m fairly certain he wasn’t the first. Systematic innovation is about methodically surveying a landscape of potential innovation while also analyzing the potential economic or social value of innovations. For the last several months I’ve been working with the CUNY Graduate School for Journalism on developing a process to systematically explore the potential for technical innovation in journalism. My hope is that this can spur new ideas and growth in Computational Journalism. In the rest of this post I’ll describe how the process is developing and provide some initial feedback we’ve gotten on how it’s working.

One way to look at innovation is in terms of problem solving: (1) what’s the problem or what’s needed, and (2) how do you reify the solution. Sure, technical innovation is not the only kind of innovation, but here my focus of “how to make it happen” will be computing. The problems and needs that I’m focused on are further constrained by the domain, journalism, and include aspects of what news consumers need and want, what news producers (e.g. professional journalists, but also others) need and want, and how value is added to information during the production process.

My basic premise is that if we can identify and enumerate concrete concepts related to needs/wants and technical solutions, then we can systematically combine different concepts to arrive at new ideas for innovation. This is the core idea of combinatorial creativity:  mashing up concepts in novel juxtapositions often sparks new ideas. Drawing on lots of research and, when possible, theory, I developed a concept space which includes 27 computing and technology concepts (e.g. natural user interfaces, computer vision, data mining, etc.), 15 needs and goals that journalists or news consumers typically have with information / media (e.g. storytelling, sensemaking, staying informed, etc.), and 14 information processes that are used to increase the value of information (e.g. filtering, ordering, summarization, etc.). That amounts to 56 concepts across four main categories (computing and technology, news consumer needs, journalism goals, and information processes).

To make the creative combination of ideas more engaging I produced and printed concept cards using Moo, which were color-coded based on their main category. Each card has a concept and brief description; here’s what they look like:

Brainstorming could happen in a lot of different ways, but for a start I decided to have groups of three people with each person randomly picking a card, one card from computing and technology and two cards from the other main categories. Then the goal is to generate as many different ideas as possible for products or services that combine those three concepts in some time-frame (say 5 minutes). A recorder in the group keeps track of the concept cards drawn and all of the ideas generated so that they can be discussed later.

The process seems to be working. Earlier this week in Jeff Jarvis’ entrepreneurial journalism class I spent some time lecturing on the different concepts and then had students break into 5 groups of 3 to play the brainstorming “game”, which looked something like this:

The reaction was largely positive, with at least one student exclaiming that she really like the exercise, and another acknowledging that there were some good ideas coming out of having to think about (and apply) combinations of concepts that they hadn’t necessarily thought of before.

In a series of 3, 5-minute rounds of brainstorming, the five groups generated 54 ideas in total, for an average of 3.6 ideas per group per round. Of course there was some variability between groups and most groups needed a round to warm-up, but there were definitely some 5-star ideas generated. Some of the ideas were for general products or services, but some were also about how technologies could enable new kinds of stories to be told — editorial creativity. For instance, an idea for a general platform was to produce 3D virtual recreations of accident spots to help viewers get a better experience of why that spot could be dangerous. Another idea was to develop an app where citizen journalists could sign up and be automatically alerted when an incident occurs near their location. On the editorial creativity side of things, some ideas included using motion capture technology to recreate crime scenes or analyses, or to illustrate workplace injuries from repetitive stress. Not all of these things would make tons of money or generate millions of clicks, but that’s not the point — for now the point is to get people thinking in new directions.

We’re still thinking about ways to improve the process, like adding pressure, constraints, or context. And generating lots of ideas is good, but step two is to think about winnowing and how to assess feasibility and quality of ideas. Stay tuned as this continues to evolve…

Finding News Sources in Social Media

Whether it’s terrorist attacks in Mumbai, a plane crash landing on the Hudson River, or videos and reactions from a recently capsized cruise ship in Italy, social media has proven itself again and again to be a huge boon to journalists covering breaking news events. But at the same time, the prodigious amount of social media content posted around news events creates a challenge for journalists trying to find interesting and trustworthy sources in the din. A few recent efforts have looked at automatically identifying misinformation on Twitter, or automatically assessing credibility, though pure automation carries the risk of cutting human decision makers completely out of the loop. There aren’t many general purpose (or accessible) solutions out there for this problem either; services like Klout help identify topical authorities, and Storify and Storyful help in assembling social media content, but don’t offer additional cues for assessing credibility or trustworthiness.

Some research I’ve been doing (with collaborators at Microsoft and Rutgers) has been looking into this problem of developing cues and filters to enable journalists to better tap into social media. In the rest of this post I’ll to preview this forthcoming research, but for all the details you’ll want to see the CHI paper appearing in May and the CSCW paper appearing next month.

With my collaborators I built an application called SRSR (standing for “Seriously Rapid Source Review”) which incorporates a number of advanced aggregations, computations, and cues that we thought would be helpful for journalists to find and assess sources in Twitter around breaking news events. And we didn’t just build the system, we also evaluated it on two breaking news scenarios with seven super-star social media editors at leading local, national, and international news outlets.

The features we built into SRSR were informed by talking with many journalists and include facilities to filter and find eyewitnesses and archetypical user-types, as well as to characterize sources according to their implicit location, network, and past content. The SRSR interface allows the user to quickly scan through potential sources and get a feeling for whether they’re more or less credible and if they might make good sources for a story. Here’s a snapshot showing some content we collected and processed around the Tottenham riots.

Automatically Identifying Eyewitnesses
A core feature we built into SRSR was the ability to filter sources based on whether or not they were likely to be eyewitnesses. To determine if someone was an eyewitness we built an automatic classifier that looks at the text content shared by a user and compares it to a dictionary of over 700 key terms relating to perception, seeing, hearing, and feeling – the kind of language you would expect from eyewitnesses. If a source uses one of the key terms then we label them as a likely eyewitness. Even using this relatively simple classifier we got fairly accurate results: precision was 0.89 and recall was 0.32. This means that if a source uses one of these words it’s highly likely they are really an eyewitness to the event, but that there were also a number of eyewitnesses who didn’t use any of these key words (thus the lower recall score). Being able to rapidly find eyewitnesses with 1st hand information was one of the most liked features in our evaluation. In the future there’s lot’s we want to do to make the eyewitness classifier even more accurate.

Automatically Identifying User Archetypes
Since different types of users on Twitter may produce different kinds of information we also sought to segment users according to some sensible archetypes: journalists/bloggers, organizations, and “ordinary” people. For instance, around a natural hazard news event, organizations might share information about marshaling public resources or have links to humanitarian efforts, whereas “ordinary” people are more likely to have more eyewitness information. We thought it could be helpful to journalists to be able to rapidly classify sources according to these information archetypes and so we built an automatic classifier for these categories. All of the details are in the CSCW paper, but we basically got quite good accuracy with the classifier across these three categories: 90-95%. Feedback in our evaluation indicated that rapidly identifying organizations and journalists was quite helpful.

Visually Cueing Location, Network, Entities
We also developed visual cues that were designed to help journalists assess the potential verity and credibility of a source based on their profile. In addition to showing the location of the source, we normalized and aggregated locations within a sources’s network. In particular we looked at the “friends” of a source (i.e. people that I follow and that follow me back) and show the top three most frequent locations in that network. This gives a sense of where this source knows people and has their social network. So even if I don’t live in London, if I know 50 people there it suggests I have a stake in that location or may have friends or other connections to that area that make me knowledgable about it. Participants in our evaluation really liked this cue as it gives a sense of implicit or social location. 

We also show a small sketch of the network of a source indicating who has shared relevant event content and is also following the source. This gives a sense of whether many people talking about the news event are related to the source. Journalists in our evaluation indicated that this was a nice credibility cue. For instance, if the Red Cross is following a source that’s a nice positive indicator.

Finally, we aggregated the top five most frequent entities (i.e. references to corporations, people, or places) that a source mentioned in their Twitter history (we were able to capture about 1000 historical messages for each person). The idea was that this could be useful to show what a source talks about, but in reality our participants didn’t find this feature that useful for the breaking news scenarios they were presented with. Perhaps in other scenarios it could still be useful?

What’s Next
While SRSR is a nice step forward there’s still plenty to do. For one, our prototype was not built for real-time events and was tested with pre-collected and processed data due to limitations of the Twitter API (hey Twitter, give me a call!!). And there’s plenty more to think about in terms of enhancing the eyewitness classifier, thinking about different ways to use network information to spider out in search of sources, and to experiment with how such a tool can be used to cover different kinds of events.

Again, for all the gory details on how these features were built and tested you can read our research papers. Here are the full references:

  • N. Diakopoulos, M. De Choudhury, M. Naaman. Finding and Assesing Social Media Information Sources in the Context of Journalism. Conference on Human Factors in Computing Systems (CHI). May, 2012. [PDF]
  • M. De Choudhury, N. Diakopoulos, M. Naaman. Unfolding the Event Landscape on Twitter: Classification and Exploration of User Categories. Proc. Conference on Computer Supported Cooperative Work (CSCW). February, 2012. [PDF]


News Headlines and Retweets

How do you maximize the reach and engagement of your tweets? This is a hugely important question for companies who want to maximize the value of their content. There are even start-ups, like Social Flow, that specialize in optimizing the “engagement” of tweets by helping to time them appropriately. A growing body of research is also looking at what factors, both of the social network and of the content of tweets, impact how often tweets get retweeted. For instance, some of this research has indicated that tweets are more retweeted when they contain URLs and hashtags, when they contain negative or exciting and intense sentiments, and when the user has more followers. Clearly time is important too and different times of day or days of week can also impact the amount of attention people are paying to social media (and hence the likelihood that something will get retweeted).

But aside from the obvious thing of growing their follower base, what can content creators like news organizations do to increase the retweetability of their tweets? Most news organizations basically tweet out headlines and links to their stories. And that delicate choice of words in writing a headline has always been a bit of a skill and an art. But with lots of data now we can start being a bit more scientific by looking at what textual and linguistic features of headlines tend to be associated with higher levels of retweets. In the rest of this post I’ll present some data that starts to scratch at the surface of this.

I collected all tweets from the @nytimes twitter account between July 1st, 2011 and Sept. 30th, 2011 using the Topsy API. I wanted to analyze somewhat older tweets to make sure that retweeting had run its natural course and that I wasn’t truncating the retweeting behavior. Using data from only one news account has the advantage that it controls for the network and audience and allows me to focus purely on textual features. In all I collected 5101 tweets, including how many times each tweet was retweeted (1) using the built-in retweet button and (2) using the old syntax of “RT @username”. Of these tweets, 93.7% contained links to NYT content, 1.0% contained links to other content (e.g. yfrog, instagram, or government information), and 0.7% were retweets themselves. The remaining 4.6% of tweets in my sample had no link.

The first thing I looked at was what the average number of retweets was for the tweets in each group (links to NYT content, links to other content, and no links).

  • Average # of RTs for tweets with links to NYT content: 48.0
  • Average # of RTs for tweets with links to other content: 48.1
  • Average # of RTs for tweets with no links: 83.8

This is interesting because some of the best research out there suggests that tweets WITH links get more RTs. But I found just the opposite: tweets with NO LINKS got more RTs (1.74 times as many on average).  I read through the tweets with no links (there’s only 234) and they were mostly breaking news alerts like “Qaddafi Son Arrested…“, “Dow drops more than 400 points…“, or “Obama and Boehner Close to Major Budget Deal…“. So from the prior research we know that for any old tweet source, URLs are a signal that is correlated with RTs, but for news organizations, the most “newsy” or retweetable information comes in a brief snippet, without a link. The implication is not that news organization should stop linking their content to get more RTs, but rather that the kind of information shared without links from news organizations (the NYT in particular) is highly retweetable.

To really get into the textual analysis I wanted to look just at tweets with links back to NYT content though. So the rest of the analysis was done on the 4780 tweets with links back to NYT content. If you look at these tweets they basically take the form: <story headline> + <link>. I broke the dataset up into the top and bottom 10% of tweets (deciles) as ranked by their total number of RTs, which includes RTs using the built-in RT button as well as the old style RTs. The overall average # of RTs was 48.3, but in the top 10% of tweets it was 173 and in the bottom 10% it was 7.4. Here’s part of the distribution:

Is length of a tweet related to how often it gets retweeted? I looked at the average length of the tweets (in characters) in the top and bottom 10%.

  • Top 10%: 75.8 characters
  • Bottom 10%: 82.8 characters

This difference is statistically significant using a t-test (t=5.23, p < .0001). So tweets that are in the top decile of RTs are shorter, on average, by about 7 characters. This isn’t prescriptive, but it does suggest an interesting correlation that headline / tweet writers for news organizations might consider exploring.

I also wanted to get a feel for what words were used more frequently in either the top or bottom deciles. To do this I computed the frequency distribution of words for each dataset (i.e. how many times each unique word was used across all the tweets in that decile). Then for each word I computed a ratio indicating how frequent it was in one decile versus the other. If this ratio is above 1 then it indicates that that word is more likely to occur in one decile than the other. I’ve embedded the data at the end of this post in case you want to see the top 50 words ranked by their ratio for both the top and bottom deciles.

From scanning the word lists you can see that pronouns (e.g. “I, you, my, her, his, he” etc.) are used more frequently in tweets from the bottom decile of RTs. Tweets that were in the top decile of RTs were more likely to use words relating to crime (e.g. “police”, “dead”, “arrest”), natural hazards (“irene”, “hurricane”, “earthquake”), sports (“soccer”, “sox”), or politically contentious issues (e.g. “marriage” likely referring to the legalization of gay marriage in NY). I thought it was particularly interesting that “China” was much more frequent in highly RTed tweets. To be clear, this is just scratching the surface and I think there’s a lot more interesting research to do around this, especially relating to theories of attention and newsworthiness.

The last bit of data analysis I did was to look at whether certain parts of speech (e.g. nouns, verbs, adjectives) were used differently in the top and bottom RT deciles. More specifically I wanted to know: Are different parts of speech used more frequently in one group than the other? To do this, I used a natural language processing toolkit (NLTK) and computed the parts of speech (POS) of all of the words in the tweets. Of course this isn’t a perfect procedure and sometimes the POS tagger makes mistakes, but I consider this analysis preliminary. I calculated the Chi-Square test to see if there was a statistical difference in the frequency of nouns, adverbs, conjunctions (e.g. “and”, “but”, etc.), determiners (e.g. “a”, “some”, “the”, etc.), pronouns, and verbs used in either the top or bottom 10% of RTs. What I found is that there is a strong statistically significant difference for adverbs (p < .02), determiners (p < .001), and verbs (p < .003), and somewhat of a difference for conjunctions (p = .06). There was no difference in usage for adjectives, nouns, or pronouns. Basically what this boils down to is that, in tweets that get lots of RTs, adverbs, determiners (and conjunctions somewhat) are used substantially less, while verbs are used substantially more. Perhaps it’s the less frequent use of determiners and adverbs that (as described above) makes these tweets shorter on average. Again, this isn’t prescriptive, but there may be something here in terms of how headlines are written. More use of verbs, and less use of “empty” determiners and conjunctions in tweets is correlated with higher levels of retweeting. Could it be the case that action words (i.e. verbs) somehow spur people to retweet the headline? Pinning down the causality of this is something I’ll be working on next!

Here are the lists of words I promised. If you find anything else notable, please leave a comment!

Journalism as Information Science

The core activity of journalism basically boils down to this: knowledge production. It’s presented in various guises: stories, maps, graphics, interviews, and more recently even things like newsgames, but it all essentially entails the same basic components of information gathering, organizing, synthesizing, and publishing of new (sometimes just new to you) knowledge. To be sure, the particular flavor of knowledge is colored by the cultural milieu, ethics, and temporal constraints through which journalism extrudes information into knowledge. Journalists add value to information and news by making sense of it, making it more accessible and memorable, and putting it in context.

Many of the practices followed by journalists in the process of knowledge production can be mapped quite neatly to corresponding ideas in information science. Thankfully, information science studies knowledge production in a much more structured fashion, and in the rest of this post I’d like to surface some of that structure as a way for reflecting on what journalists do, and for thinking about how technology could enhance such processes.

Much of what journalists are engaged with on a day-to-day basis is in adding value to information. Raw data and information is harvested from the world, and as the journalist gathers it and makes sense of it, puts it in context, increases its quality, and frames it for decision making, it gets more and more valuable to the end-user. And by “value” I don’t necessarily mean monetary, but rather usefulness in meeting a user need. This point is important because it implies that the value of information is perceived and driven by user-needs in context. And the process is cyclical or recursive. The output of someone else, be it an article, tweet, or comment can be fed into the process for the next output.

Robert S. Taylor, one of the fathers of information studies at Syracuse University, wrote an entire book on value-added processes in information systems. Below I examine the processes that he described. There may be some information processes that journalists could learn to  do more effectively, with or without new tools. Taylor organized the processes into four broad categories:

  • Ease of Use: This includes information usability such as information architecture (i.e. how to order information), design (i.e. how to format and present information), and browseability. When journalists take a table of numbers and present them as a map or graph they are making that data far more accessible and usable; when they write a compelling story which incorporates those numbers it is also increasing value through usability. Physical accessibility is also important to ease of use, and there’s no doubt that the physical accessibility of information on a mobile or tablet is different than on a desktop.
  • Noise Reduction: This involves the processes of inclusion and exclusion with an understanding of relevance that may be informed by context or end-user needs. Journalists are constantly engaging as noise reducers as they assemble a story and decide what is relevant to include and what is not, and even by their very judgement of what is considered newsworthy. Summarization is another dimension of this, as is linking which provides access to other relevant information.
  • Quality: A lot of value is added to information by enhancing its quality. Quality decisions depend on quality information: garbage in, garbage out. Quality includes aspects of accuracy, comprehensiveness (i.e. completeness of coverage), currency, reliability (i.e. consistent and dependable), and validity. Journalists engage (sometimes) in factchecking to enhance accuracy, as well as corroboration of sources as a method to increase validity. Different end-user contexts and needs have different demands on quality: non-breaking news doesn’t have the same demands on currency for instance. Seeing as quality (i.e. a commitment to truth) is a central value of journalism, it stands to reason that tools built for journalism might consider new ways of enhancing quality.
  • Adaptability: The idea of adaptability is that information is most valuable when it meets specific needs of a person with a particular problem. This involves knowing what users’ information needs are. Another dimension is that of flexibility, providing a variety of ways to work with information. Oftentimes I think adaptability is addressed in journalism through nichification – that is one outlet specializes in a particular information need, like for example, Consumer Reports.

You can’t really argue that any of these processes aren’t important to the knowledge produced by journalists, and many (all?) of them are also important to others who produce knowledge. There are people out there specialized in some of these activities. For instance, my alma mater, Georgia Tech, pumps out masters degrees in Human Computer Interaction, which teaches you a whole lot about that first category above – ease of use. Journalism could benefit from more cross-functional teams with such specialists.

The question moving forward is: How can technology inform the design of new tools that enable journalists to add the above values to information? Quality seems like a likely target since it is so important in journalism. But aspects of noise reduction (summarization), and adaptability may also be well-suited to developing augmenting technologies. Moreover, newer forms of information (e.g. social media) are in need for new processes that can add value.

Modeling Computing and Journalism (Part I)

Recently I’ve been thinking more about modeling the intersection of computing and journalism, and in particular thinking about ways that aspects of computing might impact or allow for innovation in journalism. It struck me that I needed a more precise definition of computing and its purview (I’ll come back to the journalism side of the equation in a later post). What, exactly, is computing? I’ll try to answer that in this post…

Definitions of computing and computer science abound online, but the most canonical comes perhaps from Peter Denning, an elder in the field of Computer Science. In a CACM article from 2005 he writes, “Computing is the systematic study of algorithmic processes that describe and transform information”. Two key words there: “algorithmic” and “information”. Computing is about information, about describing and transforming it, but also about acquiring, representing, structuring, storing, accessing, managing, processing, manipulating, communicating, and presenting it. And computing is about algorithms: their theory, feasibility, analysis, structure, expression, and implementation. The fundamental question of computing concerns what information processes can be effectively automated.

In modern CS there is a huge body of knowledge that stems from this core notion of computing. For instance, the Computer Science Curriculum defined in 2008 defines 14 different areas of knowledge (see list below). The Georgia Tech College of Computing delineates some of these areas as belonging to core computer science, and others belonging to interactive computing. Roughly, core computer science deals with the conceptual (i.e. mathematical), and operational (i.e nuts and bolts of how a modern computer works) aspects of computing. Interactive computing on the other hand mostly deals with information input, modeling, and output. There are aspects of professional practice, engineering, and design that apply in both.

Core Computer Science

  • Discrete Structures, Programming Fundamentals, Software Engineering, Algorithms and Complexity, Architecture and Organization, Operating Systems, Programming Languages, Net Centric Computing, Information Management, Computational Science

Interactive Computing

  • Human Computer Interaction, Graphics and Visual Computing, Intelligent Systems

In terms of modeling the intersection of computing and journalism it’s the interactive side of things that’s most interesting. How information is moved around inside a computer is less important for journalists to understand than the interactive capabilities of information input, modeling, and output afforded by computing.  That is, how does computing interface with the rest of the world? Of course many of the capabilities of computers studied in interactive computing rest on solid foundations of core computer science (e.g. you couldn’t get much done without an operating system to schedule processes and manage data). Core areas with particular relevance to interactive computing are technologies in networking/communications, information management, and to a lesser extent computational science. Below I list more detailed sub-areas for each of the interactive computing and related core areas.

  • Human Computer Interaction (HCI) includes sub-areas such as interaction design, user-centered design, multimedia systems, collaboration, online communities, human-robot interaction, natural interaction, tangible interaction, mobile and ubiquitous computing, wearable computing, and information visualization
  • Graphics and Visual Computing includes sub-areas such as geometric modeling, materials modeling and simulation, rendering, image synthesis, non-photorealistic rendering, volumetric rendering, animation, motion capture, scientific visualization, virtual environments, computer vision, image processing and editing, game engines, and computational photography
  • Intelligent Systems includes sub-areas such as general AI including search and planning, cognitive science, knowledge-based reasoning, agents, autonomous robotics, computational perception, machine learning, natural language processing and understanding, machine translation, speech recognition, and activity recognition
  • Net Centric Computing includes aspects of networking, web architecture, compression, and mobile computing.
  • Information Management includes aspects of database systems, information architecture, query languages, distributed data, data mining, information storage and retrieval, hypermedia, and multimedia databases.
  • Computational Science includes aspects of modeling, simulation, optimization, and parallel computing often oriented towards big data sets.

So what can we do with this detailed typology of interactive computing technology?

In a 2004 CACM article Paul Rosenbloom developed a notation for describing how computing interacts with other fields. In his typology, he articulated ways in which computing could implement, interact with, and embed with other disciplines, namely with physical, life, and social sciences. These different relationships between fields lead to different kinds of ideas for technology (e.g. an embedding relationship of computing in life sciences would be the notion of cyborgs, an interaction between computing and physical sciences would be robotics). In this spirit, later on in this blog series I’ll look more specifically at how some of the computing technologies articulated above can map to aspects of journalism practice, with an eye toward innovation in journalism by applying computing in new or under-explored ways.

Newsgame Platforms

So this past weekend I had the opportunity (and pleasure) to attend a newsgames workshop at the University of Minnesota. The purpose of the gathering, which brought in academics, game designers, and journalists, was to brainstorm around the topic of newsgames. What are some of the questions that we need to address in order to make progress in this domain?

While there were discussions on everything from the business end of monetizing games, to organizational / cultural clashes, here I’m going to summarize some of the thinking we did on the medium of newsgames itself including issues of building platforms for newsgames. Platforms is, incidentally, one of the areas discussed in Newsgames: Journalism at Play by Bogost, Ferrari, and Schweizer.

At the top of our list was the question of how news organizations could repurpose their existing content (including text, video, audio, or data) into newsgames? There’s a huge investment in the content that’s already being produced by newsrooms. Can this form a platform for newsgames? Can we come up with new ways to take content that’s already produced and create compelling, playable experiences from the content? Once we figure out effective mappings, can we generate these content games automatically, or with minimal human involvement? Some examples of games already touching on this space are Hangman RSS, and Scoop, both of which use news headlines to produce word puzzle games. Some of my own work on Salubrious Nation has looked at how to take data sets from the likes of data.gov and turn them into playable experiences.

A recurring tension that we identified was the timeliness issue. What’s the scale and speed with which newsgames need to be developed? Certainly, there are many different types of stories that could be told with newsgames; do we need them for breaking news, or does it make more sense to make newsgames for ongoing issues and debates? Programming is simply time-consuming, and combined with editorial development, newsgames cab be pretty slow to develop. But, if we were to think of a platform or templates for newsgames that make use of recurring streams of information this could alleviate the time strain. We brainstormed some content streams that we thought would fit this model: sports data, budgets, economic indicators, natural disasters, weather, conflict / war, births / deaths, business / financial statements, movie releases, book/restaurant/other reviews, traffic, crime, comments and other user generated content, travel … and the list could go on. If we have cyclic data streams, why not create game templates that can be quickly generated based on the latest dump of that stream?

Running counter to the idea of developing a platform for newsgames was the tension between abstraction and specificity. If you build a framework (abstracting the process) what does this mean for the kinds of stories you can tell? Typically games are rich, semantically laden experiences, so if we platformatize the newsgame production process we might lose some of that nuance and richness. Let’s draw an analogy to Google maps as a platform for developing geo-stories. When those first came out they were relatively limited and you pretty much just had pink pins to indicate locations: certainly constraining to the types and richness of stories you could tell. But now you can do a lot more with Google maps: it’s more customizable, you can embed google charts, and the flexibility built into the framework allows for many different types of stories to be told. This makes me optimistic that we might yet find platforms for newsgames that vastly simplify the authoring process but still allow for a certain flexibility and nuance to the story.

These are really just a sampling of the issues and questions that were discussed at the workshop, but some that I personally thought were the most interesting. There’s a lot of work to do in this space, both designing and studying what works and what doesn’t. It’s great to have participated in the brainstorming; now it’s time to get to work.