Category Archives: journalism

Review: The Functional Art

I don’t often write reviews of books. But I can’t resist offering some thoughts on The Functional Art, a new book by Alberto Cairo aimed at teaching the basics of information graphics and visualization, mostly because I think it’s fantastic, but also because I think there are a few areas where I’d like to see a future edition expound.

Basically I see this as the new default book for teaching journalists how to do infographics and visualization. If you’re a student of journalism, or just interested in developing better visual communication skills I think this book has a ton to offer and is very accessible. But what’s really amazing is that the book also offers a lot to people already in the field (e.g. designers or computer scientists) who want to learn more about the journalistic perspective on visual storytelling. There are nuggets of wisdom sprinkled throughout the book, informed by Cairo’s years of journalism experience. And the diagrams and models of thinking about things like the designer-user relationships or dimensions along which graphics vary adds some much needed structure that forms a framework for thinking about and characterizing information graphics.

Probably the most interesting aspect of the book for someone already doing or studying visualization is the last set of chapters which detail, through a series of interviews with practitioners, how “the sausage is made.” Exposing process in this way is extremely valuable for learning how these things get put together. This exposition continues on the included DVD in which additional production artifacts, sketchs, and mockups form a show-and-tell. And it’s not just about artifacts; the interviews also explore things like how teams are composed in order to facilitate collaborative production.

One of the things I appreciated most about the book is that, in light of its predominant focus on practice, Cairo fearlessly  reads into and then translates research results into practical advice, offering an evidence-based rationale for design decisions. We need more of that kind of thinking, for all sorts of practices.

I have only a few critiques of the book. The first is straightforward: I wish that the book was printed in a larger format because some of the examples shown in the book are screaming for more breathing space. I would have also liked to see the computer science perspective represented a bit more thoroughly in the book – this can for instance serve to enhance and add depth to the discussion about interactivity with visualizations. My only other critique of the book is about critique itself. What I mean is that the idea of critique is sprinkled throughout the book, but I’d almost like to see it elevated to the status of having its own chapter. Learning the skills of critique and the thought process involved is an essential aspect of learning to be a graphics communication intellectual and thoughtful practitioner. And it can and should be taught in a way that students learn a systematic way for thinking and analyzing benefits and tradeoffs. Cairo has the raw material to do this in the book, but I wish it were formalized in some way that lent it the attention it deserves. Such a method could even be illustrated using some of the interviewees’ many examples.

 

Does Local Journalism Need to Be Locally Sustainable?

The last couple of weeks have seen the rallying cries of journalists echo online as they call for support of the Homicide Watch Kickstarter campaign. The tweets “hit the fan” so to speak, Clay Shirky implored us to not let the project die, and David Carr may have finally tipped the campaign with his editorial questioning foundations’ support for Big News at the expense of funding more nimble start-ups like Homicide Watch.

It seems like a good idea too – providing more coverage of a civically important issue – and one that’s underserved to boot. But is it sustainable? As Jeff Sonderman at Poynter wrote about the successful Kickstarter campaign, “The $40,000 is not a sustainable endowment, just a stopgap to fund intern staffing for one year.”

For Homicide Watch to be successful at franchising to other cities (i.e. by selling a platform) each of those franchises itself needs to be sustained. This implies that, on a local level, either enough advertising buy-in, local media support, or crowdfunding (a la Kickstarter) would need to be generated to pay those pesky labor costs, the most expensive cost in most content businesses.

Here’s the thing. Even though Homicide Watch was funded, it struggled to get there, mostly surviving on the good-natured altruism of the media elite. I doubt that local franchises will be able to repeat that trick. Here’s why: most of the donors who gave to Homicide Watch were from elsewhere in the U.S. (68%) or from other countries (10%). Only  22% of donors where from DC, Virginia, or Maryland (see below for details on where the numbers come from). But this means that people local to Washington, DC, those who ostensibly would have the most to gain from a project like this, barely made up more than a fifth of the donors. Other local franchises probably couldn’t count on the kind of national attention that the media elite brought to the Homicide Watch funding campaign, nor could they count on the national interest afforded to the nation’s capital.

You might argue that for something like this to flourish it needs local support, from the people who would get the real utility of the innovation. At least Homicide Watch got a chance to prove itself out, but we’ll have to wait to see if it can make a sustainable business and provide real information utility at a local level. The numbers at this stage would seem to suggest it’s got an uphill battle ahead of it.

Stats
Here’s how I got the stats I quoted above. I made a Scraper wiki script to collect all of the donors on the Homicide Watch Kickstarter page (there were 1,102 as of about noon on 9/12). Of those 1102, 270 donors had geographic information (city, state, country). The stats quoted above are based on those 270 geotagged donors. Of course, that’s only about 25% of the total donors, so an assumption that I make above is that the 75%, the non-geotagged donors, follow a similar geographic distribution (and donation magnitude distribution) as the geotagged ones. I can’t think of a reason that assumption might not be true. For kicks I put the data up on Google Fusion Tables (it’s so awful, please, someone fix that!) so here’s a map of what states donors come from.

Fact-Checking at Scale

Note: this is cross-posted on the CUNY Tow-Knight Center for Entrepreneurial Journalism site. 

Over the last decade there’s been a substantial growth in the use of Fact-Checking to correct misinformation in the public sphere. Outlets like Factcheck.org and Politifact tirelessly research and assess the accuracy of all kinds of information and statements from politicians or think-tanks. But a casual perusal of these sites shows that there are usually only 1 or 2 fact-checks per day from any given outlet. Fact-Checking is an intensive research process that demands considerable skilled labor and careful consideration of potentially conflicting evidence. In a task that’s so labor intensive, how can we scale it so that the truth is spread far and wide?

Of late, Politifact has expanded by franchising its operations to states – essentially increasing the pool of trained professionals participating in fact-checking. It’s a good strategy, but I can think of at least a few others that would also grow the fact-checking pie: (1) sharpen the scope of what’s fact-checked so that attention is where it’s most impactful, (2) make use of volunteer, non-professional labor via crowdsourcing, and (3) automate certain aspects of the task so that professionals can work more quickly. In the rest of this post, I’ll flesh out each of these approaches in a bit more detail.

Reduce Fact-Checking Scope
“I don’t get to decide which facts are stupid … although it would certainly save me a lot of time with this essay if I were allowed to make that distinction.” argues Jim Fingal in his epic fact-check struggle with artist-writer John D’Agata in The Lifespan of a Fact. Indeed, some of the things Jim checks are really absurd: did the subject take the stairs or the elevator, did he eat “potatoes” or “french fries”; these things don’t matter to the point of that essay, nor, frankly, to me as the reader.

Fact-checkers, particularly the über-thorough kind employed by magazines, are tasked with assessing the accuracy of every claim or factoid written in an article (See the Fact Checker’s Bible for more). This includes hard facts like names, stats, geography, and physical properties as well as what sources claim via a quotation, or what the author writes from notes. Depending on the nature of the claim some of it may be subjective, opinion-based, or anecdotal. All of this checking is meant to protect the reputation of the publication and of the writers. To maintain trust with the public. But it’s a lot to check and the imbalance between content volume and critical attention will only grow.

To economize their attention fact-checkers might better focus on overall quality; who cares if they’re “potatoes” or “french fries”? In information science studies, the notion of quality can be defined as the “value or ‘fitness’ of the information to a specific purpose or use.” If quality is really what we’re after then fact-checking would be well-served and more efficacious if it focused the precious attention of fact-checkers on claims that have some utility. These are the claims that if they were false could impact the outcome of some event or an important decision. I’m not saying accuracy doesn’t matter, it does, but fact-checkers might focus more energy on information that impacts decisions. For health information this might involve spending more time researching claims that impact health-care options and choices; for finance it would involve checking information informing decisions about portfolios and investments. And for politics this involves checking information that is important for people’s voting decisions – something that the likes of Politifact already focus on.

Increased Use of Volunteer Labor
Another approach to scaling fact-checking is to incorporate more non-professionals, the crowd, in the truth-seeking endeavor. This is something often championed by social media journalists like Andy Carvin, who see truth-seeking as an open process that can involve asking for (and then vetting) information from social media participants. Mathew Ingram has written about how platforms like Twitter and Reddit can act as crowdsourced fact-checking platforms. And there have been several efforts toward systematizing this, notably the TruthSquad, which invited readers to post links to factual evidence that supports or opposes a single statement. A professional journalist would then write an in-depth report based on their own research plus whatever research the crowd contributed. I will say I’m impressed with the kind of engagement they got, though sadly it’s not being actively run anymore.

But it’s important to step back and think about what the limitations of the crowd in this (or any) context really are. Graves and Glaisyer remind us that we still don’t really know how much an audience can contribute via crowdsourced fact-checking. Recent information quality research by Arazy and Kopak gives us some clues about what dimensions of quality may be more amenable to crowd contributions. In their study they looked at how consistent ratings of various wikipedia articles were along dimensions of accuracy, completeness, clarity, and objectivity. They found that, while none of these dimensions had particularly consistent ratings, completeness and clarity were more reliable than objectivity or accuracy. This is probably because it’s easier to use a heuristic or shortcut to assess completeness, whereas rating accuracy requires specialized knowledge or research skill. So, if we’re thinking about scaling fact-checking with a pro-am model we might have the crowd focus on aspects of completeness and clarity, but leave the difficult accuracy work to the professionals.

#Winning with Automation
I’m not going to fool anyone by claiming that automation or aggregation will fully solve the fact-checking scalability problem. But there may be bits of it that can be automated, at least to a degree where it would make the life of a professional fact-checker easier or make their work go faster. An automated system could allow any page online to be quickly checked for misinformation. Violations could be flagged and highlighted, either for lack of corroboration or for controversy, or the algorithm could be run before publication so that a professional fact-checker could take a further crack at it.

Hypothetical statements, opinions and matters of taste, or statements resting on complex assumptions may be too hairy for computers to deal with. But we should be able to automatically both identify and check hard-facts and other things that are easily found in reference materials. The basic mechanic would be one of corroboration, a method often used by journalists and social scientists in truth-seeking. If we can find two (or more) independent sources that reinforce each other, and that are credible, we gain confidence in the truth-value of a claim. Independence is key, since political, monetary, legal, or other connections can taint or at least place contingencies on the value of corroborated information.

There have already been a handful of efforts in the computing research literature that have looked at how to do algorithmic corroboration. But there is still work to do to define adequate operationalizations so that computers can do this effectively. First of all, we need to define, identify, and extract the units that are to be corroborated. Computers need to be able to differentiate a factually stated claim from a speculative or hypothetical one, since only factual claims can really be meaningfully corroborated. In order to aggregate statements we then need to be able to match two claims together while taking into account different ways of saying similar things. This includes the challenge of context, the tiniest change in which can alter the meaning of a statement and make it difficult for a computer to assess the equivalence of statements. Then, the simplest aggregation strategy might consider the frequency of a statement as a proxy for its truth-value (the more sources that agree with statement X, the more we should believe it) but this doesn’t take into the account the credibility of the source or their other relationships, which also need to be enumerated and factored in. We might want algorithms to consider other dimensions such as the relevance and expertise of the source to the claim, the source’s originality (or lack thereof), the prominence of the claim in the source, and the source’s spatial or temporal proximity to the information. There are many challenges here!

Any automated corroboration method would rely on a corpus of information that acts as the basis for corroboration. Previous work like DisputeFinder has looked at scraping or accessing known repositories such as Politifact or Snopes to jump-start a claims database, and other work like Videolyzer has tried to leverage engaged people to provide structured annotations of claims. Others have proceeded by using the internetas a massive corpus. But there could also be an opportunity here for news organizations, who already produce and have archives of lots of credible and trustworthy text (e.g. rigorously fact-checked magazines), to provide a corroboration service based on all of the claims embedded in those texts. Could news organizations even make money by syndicating their archives like this?

There are of course other challenges to fact-checking that also need to be surmounted, such as the user-interface for presentation or how to effectively syndicate fact-checks across different media. In this essay I’ve argued that scale is one of the key challenges to fact-checking. How can we balance scope with professional, non-professional, and computerized labor to get closer to the truth that really matters?

 

Tweaking Your Credibility on Twitter

You want to be credible on social media, right? Well, a paper to be published at the Conference on Computer Supported Cooperative Work (CSCW) in early 2012 from researchers at Microsoft and Carnegie Mellon suggests at least a few actionable methods to help you do so. The basic motivation for the research is that when people see your tweet via a search (rather than following you) they have less cues to assess credibility. With a better understanding of what factors influence tweet credibility, new search interfaces can be designed to highlight the most relevant credibility cues (now you see why Microsoft is interested).

First off, five people were interviewed by the researchers to collect a range of issues that might be relevant to credibility perception. They came up with a list of 26 possible credibility cues and then ran a survey with 256 respondents in which they asked how much each feature impacted credibility perception. You can see the paper for the full results, but, for instance, things like keeping your tweets on a similar topic, using a personal photo, having a username related to the topic, having a location near a topic, having a bio that suggests relavent topical expertise, and frequent tweeting were all perceived by participants to positively impact credibility to some extent. Things like using non-standard grammar and punctuation, using the default user image were seen to detract from credibility.

Based on their first survey, the researchers then focused on three specific credibility cues for a follow-on study: (1) topic of tweets (politics, science, or entertainment), (2) user name style (first_last, internet – “tenacious27″, and topical – “AllPolitics”), and finally (3) user image (male / female photo, topical icon, generic icon, and default). For the study, each participant (there were 266) saw some combination of the above cues for a tweet, and rated both tweet credibility and author credibility. Unsurprisingly tweets about the science topic were rated as more credible than those on politics or entertainment. The most surprising result to me was that topically relevant user names were more credible than traditional names (or internet style names, though that’s not surprising). In a final follow-up experiment the researchers found that the user image doesn’t impact credibility perceptions, except for when the image is the default image in which case it significantly (in the statistical sense) lowers perceptions of tweet credibility.

So here are the main actionable take-aways:

  • Don’t use non standard grammar and punctuation (no “lol speak”)
  • Don’t use the default image.
  • Tweet about topics like science, which seem to carry an aura of credibility.
  • Find a user name that is topically aligned with those you want to reach.
That last point of finding a topically aligned user name might be an excellent strategy for large news organizations to build a more credible presence across a range of topics. For instance, right now the NY Times has a mix of accounts that have topical user names, as well as reporters using their real names. In addition to each reporter having their own “real name” account, individual tweets of theirs that were topically relevant could be routed to the appropriate topically named account. So for instance, let’s say Andy Revkin tweets something about the environment. That tweet should also show up via the Environment account, since the tweet may be perceived as having higher credibility from a topically-related user name. For people who search and find that tweet, of course if they know who Andy Revkin is, then they’ll find his tweet quite credible since he’s known for having that topical expertise. But for someone else who doesn’t know who Andy Revkin is, the results of the above study suggest that that person would find the same content more credible coming from the topically related Environment account. Maybe the Times or others are already doing this. But if not, it seems like there’s an opportunity to systematically increase credibility by adopting such an approach.

Designing Tools for Journalism

Whether you’re designing for professionals or amateurs, for people seeking to reinvigorate institutions or to invent new ones, there are still core cultural values ensconced in journalism that can inspire and guide the design of new tools, technologies, and algorithms for committing acts of journalism. How can we preserve the best of such values in new technologies? One approach is known as value sensitive design and attempts to account for human values in a comprehensive manner throughout the design process by identifying stakeholders, benefits, values, and value conflicts to help designers prioritize features and capabilities.

“Value” is defined as “what a person or group of people consider important in life”. Values could include things like privacy, property rights, autonomy, and accountability among other things. What does journalism value? If we can answer that question, then we should be able to design tools for professional journalists that are more easily adopted (“This tool makes it easy to do the things I find important and worthwhile!”), and we should be able to design tools that more easily facilitate acts of journalism by non-professionals (“This tool makes it easy to participate in a meaningful and valuable way with a larger news process!”). Value sensitive design espouses consideration of all stakeholders (both direct and indirect) when designing technology. I’ve covered some of those stakeholders in a previous post on what news consumers want, but another set of stakeholders would be those relating to the business model (e.g. advertisers). In any case, mismatches between the values and needs of different stakeholders will lead to conflicts that need to be resolved by identifying benefits and prioritizing features.

When we turn to normative descriptions of journalism, such as Kovach and Rosenstiel’s The Elements of Journalism and Blur, Schudson’s The Sociology of News, or descriptions of ethics principles from the AP or ASNE, we find both core values, as well as valued activities. It’s easiest to understand these as ideals which are not always met in practice. Some core values include:

  • Truth: including a commitment to accuracy, verification, transparency, and putting things in context
  • Independence: from influence by those they cover, from politics, from corporations, or from others they seek to monitor
  • Citizen-first: on the side of the citizen rather than for corporations or political factions
  • Impartial: except when opinion has been clearly marked
  • Relevance: to provide engaging and enlightening information

Core values also inform valued activities or roles, such as:

  • Informer: giving people the information they need or want about contemporary affairs of public interest
  • Watchdog: making sure powerful institutions or individuals are held to account (also called “accountability journalism”)
  • Authenticator: assessing the truth-value of claims (“factchecking”); also relates to watchdogging
  • Forum Organizer: orchestrating a public conversation, identifying and consolidating community
  • Aggregator: collecting and curating information to make it accessible
  • Sensemaker: connecting the dots and making relationships salient

Many of these values and valued activities can be seen from an information science perspective as contributing to information quality, or the degree of excellence in communicating knowledge. I’ll revisit the parallels to information science in a future post.

Besides core values and valued activities, there are other, perhaps more abstract, processes which are essential to producing journalism, like information gathering, organization and sensemaking, communication and presentation, and dissemination. Because they’re more abstract these processes have a fair amount of variability as they are adapted for different milieu (e.g. information gathering on social media) or media (e.g. text, image, video, games). Often valued activities are already the composition of several of these underlying information processes that have been infused with core values. We should be on the lookout for “new” valued activities waiting for products to emerge around them, for instance, by considering more specific value-added information processes in conjunction with core values.

There’s a lot of potential for technology to re-invent and re-imagine valued activities and abstract information processes in light of core values: to make them more effective, efficient, satisfying, productive, and usable. Knowing the core values also helps designers understand what would not be acceptable to design for professionals (e.g. a platform to facilitate the acquisition of paid sources would probably not be adopted in the U.S.). I would argue that it’s the function that is served by the above valued activities, and not the institutionalized practices that are currently used to accomplish them, that is fundamentally important to consider for designers. While we should by all means consider designs that adhere to core values and to an understanding of the outputs of valued activities, we should also be open to allowing technology to enhance the processes and methods which get us there. Depending on whether you’re innovating in an institutional setting or in an unencumbered non-institutional environment you have different constraints, but, irregardless I maintain that value sensitive design is a good way forward to ensure that future tools for journalism will be more trustworthy, have more impact, and resonate more with the public.

Unpacking Visualization Rhetoric

Note: An edited version of the following also appears on the Chart.io blog. 

Visualization can be useful for both more exploratory purposes (e.g. generating analyses and insights based on data) as well as more communicative ends (e.g. helping other people understand and be persuaded or informed by the insights that you’ve uncovered). Oftentimes more general visualization techniques are used in the exploratory phase, whereas more specific, tailored, and hand-crafted techniques (like infographics) tend to be preferred for maximal persuasive potential in the communicative phase.

In the middle ground is a class of visualizations termed “narrative visualization” – often used in journalism contexts – which tend to include aspects of both exploratory and communicative visualization. This blending of techniques makes for an interesting domain of study and it’s here where Jessica Hullman and I began investigating how different rhetorical (persuasive) techniques are employed in visualization. We were particularly interested in how different rhetorical techniques can be used to affect the interpretation of a visualization – valuable knowledge for visualization designers hoping to influence and mold the interpretation of their audience. (Here we defer the sticky ethical question of whether someone should use these techniques since in general they can be used for both good and ill).

We carefully analyzed 51 narrative visualizations and constructed a taxonomy of rhetorical techniques we found being used. We observed rhetorical techniques being employed at four different editorial layers of a visualization: data, visual representation, annotations, and interactivity. Choices at any of these layers can have important implications for the ultimate interpretation of a visualization (e.g. the design of available interactivity can direct or divert attention). The five main classes of rhetoric we found being used include: information access (e.g. how data is omitted or aggregated), provenance (e.g. how data sources are explained and how uncertainty is shown), mapping (e.g. the use of visual metaphor), linguistic techniques (e.g. irony or apostrophe), and procedural rhetoric (e.g. how default views anchor interpretation).

The maxim “know thy audience” points to another dimension by which a visualization creator can influence the interpretation of a visualization. While most visualizations concentrate on the denotative level of communication, the most effective visualization communicators also make use of the connotative level of communication to unlock a whole other plane of interpretation. For instance, various cultural codes (e.g. what colors mean), or conventions (e.g. line graphs suggest you’re looking at temporal data even if you’re not) can suggest alternate or preferred interpretations.

While the full explanation of the taxonomy and use of codes and connotation for communication in visualization is beyond this blog post, you can see a more complete discussion in a pre-print of our forthcoming InfoVis paper.  At the very least though I’ll leave you with an example which illustrates some of these concepts.

Take the following recent example from the New York Times where various aspects of the visualization rhetoric framework apply.

The choice of labeling on the dimensions of the chart “reduce spending” vs. “don’t reduce spending” leaves out another option, “increase spending”. The choice of the color green for “willing to compromise” connotes a certain value judgement (i.e. “go, or move ahead”) as read from an American perspective. The way individual squares are aggregated to arrive at an overall color is unclear, leading to questions that could be clarified through better use of provenance rhetoric. Moreover, squares cannot be disaggregated or understood as individual data, making it difficult for users to interpret either the magnitude of the response or the specific data reported in any one square. While compelling, applying the visualization rhetoric framework during the design of this visualization could have suggested other ways to make the interpretation of the visualization more clear.

Ultimately visualization rhetoric is a framework that can be useful for designers hoping to maximize the communicative potential of a visualization. Exploratory visualization platforms (like Tableau or Chart.io) could also be enhanced with an awareness of visualization rhetoric, by, for instance, allowing users to make salient use of certain rhetorical techniques when the time comes to share a visualization.

Those particularly interested in this space should consider participating in an upcoming workshop I am co-organizing on “Telling Stories with Data” at InfoVis 2011 in Providence, RI in late October.

Visualization, Data, and Social Media Response

I’ve been looking into how people comment on data and visualization recently and one aspect of that has been studying the Guardian’s Datablog. The Datablog publishes stories of and about data, oftentimes including visualizations such as charts, graphs, or maps. It also has a fairly vibrant commenting community.

So I set out to gather some of my own data. I scraped 803 articles from the Datablog including all of their comments. Of this data I wanted to know if articles which contained embedded data tables or embedded visualizations produced more of a social media response. That is, do people talk more about the article if it contains data and/or visualization? The answer is yes, and the details are below.

While the number of comments could be scraped off of the Datablog site itself I turned to Mechanical Turk to crowdsource some other elements of metadata collection: (1) the number of tweets per article, (2) whether the article has an embedded data table, and (3) whether the article has an embedded visualization. I did a spot check on 3% of the results from Turk in order to assess the Turkers’ accuracy on collecting these other pieces of metadata: it was about 96% overall, which I thought was clean enough to start doing some further analysis.

So next I wanted to look at how the “has visualization” and “has table” features affect (1) tweet volume, and (2) comment volume. There are four possibilities: the article has (1) a visualization and a table, (2) a visualization and no table, (3) no visualization and a table, (4) no visualization and no table. Since both the tweet volume and comment volume are not normally distributed variables I log transformed them to get them to be normal (this is an assumption of the following statistical tests). Moreover, there were a few outliers in the data and so anything beyond 3 standard deviations from the mean of the log transformed variables was not considered.

For number of tweets per article:

  1. Articles with both a visualization and a table produced the largest response with an average of 46 tweets per article (N=212, SD=103.24);
  2. Articles with a visualization and no table produced an average of 23.6 tweets per article (N=143, SD=85.05);
  3. Articles with no visualization and a table produced an average of 13.82 tweets per article (N=213, SD=42.7);
  4. And finally articles with neither visualization nor table produced an average of 19.56 tweets per article (N=117, SD=86.19).

I ran an ANOVA with post-hoc Bonferroni tests to see if these means were significant. Articles with both a visualization and a table (case 1) have a significantly higher number of tweets than cases 3 (p < .01) and 4 (p < .05). Articles with just the visualization and no data table have a higher number of average tweets per article, but this was not statistically significant. The take away is that it seems that the combination of a visualization and a data table drives a significantly higher twitter response.

Results for number of comments per article are similar:

  1. Articles with both a visualization and a table produced the largest response with an average of 17.40 comments per article (SD=24.10);
  2. Articles with a visualization and no table produced an average of 12.58 comments per article (SD=17.08);
  3. Articles with no visualization and a table produced an average of 13.78 comments per article (SD=26.15);
  4. And finally articles with neither visualization nor table produced an average of 11.62 comments per article (SD=17.52)

Again with the ANOVA and post-hoc Bonferroni tests to assess statistically significant differences between means. This time there was only one statistically significant difference: Articles with both a visualization and a table (case 1) have a higher number of comments than articles with neither a visualization nor a table (case 4). The p value was 0.04. Again, the combination of visualization and data table drove more of an audience response in terms of commenting behavior.

The overall take-away here is that people like to talk about articles (at least in the context of the audience of the Guardian Datablog) when both data and visualization are used to tell the story. Articles which used both had more than twice the number of tweets and about 1.5 times the number of comments versus articles which had neither. If getting people talking about your reporting is your goal, use more data and visualization, which, in retrospect, I probably also should have done for this blog post.

As a final thought I should note there are potential confounds in these results. For one, articles with data in them may stay “green” for longer thus slowly accreting a larger and larger social media response. One area to look at would be the acceleration of commenting in addition to volume. Another thing that I had no control over is whether some stories are promoted more than others: if the editors at the Guardian had a bias to promote articles with both visualizations and data then this would drive the audience response numbers up on those stories too. In other words, it’s still interesting and worthwhile to consider various explanations for these results.

A Functional Roadmap for Innovation in Computational Journalism

By: Nicholas Diakopoulos, Ph.D.
School of Communication and Information, Rutgers University
Original Version January, 2010; Updated April 2011. A PDF is also available.

Overview

Journalism in all of its senses spans a spectrum of meaning ranging from social purpose (e.g. watchdogging), to professionalized practice (e.g. ethics and professional standards), to the functional processes that journalists employ. Innovation in journalism can happen within or across this hierarchy of meanings, but in this paper, in particular, I will explore the role that computing can play in the process aspects of journalism. My intent is to lay a foundation of computational thinking for journalistic processes upon which updated journalistic practices and reinvigorated journalistic purposes can be built.

From a process perspective, Computational Journalism is the application of computing and computational thinking to the activities of journalism including information gathering, organization and sensemaking, communication and presentation, and dissemination and public response to news information, all while upholding core values of journalism such as accuracy and verifiability. It is inclusive of CAR (Computer-Assisted Reporting) but distinctive in its focus on the processing capabilities (e.g. aggregating, relating, correlating, abstracting) of the computer in comparison to mundane aspects of storage or access. The field draws on technical sub-fields of computer science including information retrieval, artificial intelligence, content analysis, visualization, personalization, and recommender systems as well as aspects of social computing and information science.

While Computational Journalism is unlikely to ever replace journalists with computers it does promise a future where the goals of human journalists are greatly enabled and augmented through computing. Moreover, its pursuit may also inform developments in Computer Science, by, for example, driving research in visual analytics and visualization, time-critical information processing, trustworthy computing, and user interfaces.

In the remainder of this paper I will discuss opportunities for innovation along the lines of the process aspects of journalism identified above. My goal is to stimulate new research and applications of these processes in the context of journalism and explore the challenges and opportunities in this space.

Information Gathering

The adoption of cheap and ubiquitous devices with photo and video capability has already had a substantial impact on how stories are reported, both in the mainstream media and through citizen journalism. While sensing hardware has gotten cheaper and more pervasive, social networking systems (e.g. Facebook) and social awareness streams (e.g. Twitter) have explicitly connected the what of sensing with the who is sensing or reporting.

The process of information gathering and reporting largely hinges on finding and verifying sources of information. Some of the best (and most difficult) journalism hinges on cultivating relationships over time with a personal network of sources. What’s different about the sources that are available from social networks is that, although they are by and large public, they may not be familiar to the journalist. Finding the desired sources while characterizing the expertise and veracity of those sources represents a barrier to fully realizing the journalistic value from these networks.

There are at least four aspects of information gathering from social networks and awareness streams that can be enhanced computationally: (1) source expertise finding, (2) source characterization (e.g. historical biases), (3) cross-referencing and independence of breaking eye-witness reports, and (4) originating source of information determination. For instance, a computational process could automatically compute the sentiment (i.e. pro / con) of a source with respect to a range of topics or issues based on their history of Twitter messages. Such rankings could then be used to inform journalists about the background of a potential interviewee. Or, consider a breaking news scenario where a journalist is attempting to cross-reference messages for validity. Algorithms can be developed to estimate the independence of those sources or to trace information back to a likely originating source. These are just a few examples of the potential areas for technical innovation in the area of information gathering.

Organization and Sensemaking

With a growth in information gathering capabilities comes the difficulty of organizing and making sense of all of that information by journalists. This is a process where computers have already had a significant impact, namely though Computer Assisted Reporting (CAR). CAR tools are usually generic in the sense that they are widely applicable to different stories, though many tools are designed for specific data types such as geographic, temporal, or network.

While many CAR tools succeed in enabling journalists to organize their information there is still considerable room for improvement in the area of sensemaking. In particular, computational perception and content analysis enable computers to convert signals about the world (including everything from sensor values to Twitter messages) into semantically and contextually laden symbols (e.g. names of people or places) or aggregate and derivative values (e.g. the sentiment or emotion of a message, the novelty or unusualness of a message with respect to an event).

Together with interactive and visual ways of presenting these computed, “semantic” facets of information there is a huge potential space for innovation in journalism tools. Some of this innovation is happening in other domains that draw on a similar process of sensemaking, such as intelligence analysis. These tools can be evaluated to better understand how they do or do not work in the context of journalism, and, in general, computational tools developed to enable sensemaking will need rigorous attention to the evaluation of their utility in real situations. Finally, sensemaking tools not only have potential for helping journalists but also for helping “readers” make sense of growing online repositories of newsworthy content and data.

Communication and Presentation

Once a story is organized and been made sense of the next process entails communicating and presenting it in a relevant and interesting way. And while I won’t argue that every story demands it, there will be some stories that benefit from computationally infused presentations of content. A journalist might use computation in such a story by making models or data interactive in a way that informs the user moreso than reading a static story.

User interfaces need to innovate more generic paradigms to compellingly communicate complex stories via models, data, simulation, and games. For instance, recent research info playable data graphics has looked into how to add game elements such as goals, scores, and advancement into how users interact with online visualized data. Other types of newsgames explore editorial simulation or decision making processes. One thing to consider as we invent these new experiences is how journalistic norms and values play out in interactive media. There are certain notions of interactive rhetoric and literacy that need to be taken into account when training computational journalists.

As governmental data becomes emancipated from closed databases (as is the current executive order in the U.S.) the opportunities for telling stories through models, data, simulation, and games will only grow. There is a range of potential new (and not yet  invented) storytelling forms that combines both elements of interactivity and computing with games, data, and news content. This will be an area ripe for alternative methods of communicating complex information in engaging and interactive formats.

Dissemination and Public Response

From a business perspective, one of the most disruptive shifts in journalism has been the process of digitization and dissemination of content online. This transition took content that was once constrained by a fixed medium and brought the variable costs of publishing space close to zero. The implication of this shift is that there is much more content out there and, practically speaking, many more ways to compete for attention for content. With unlimited space come the issues of information overload and scale.

Computation can improve the process of dissemination by addressing information overload and scale issues through, for instance, personalization and content adaptation systems as well as recommender systems. Many of the methods developed will also be applicable to monetization strategies since the fundamental scale issue revolves around matching a paucity of attention with the right content in order to drive higher advertising revenue.

Another implication of unlimited publishing space is that instead of being constrained to a narrow “letters to the editor” page, public response can instead expand to whatever the community needs dictate. In managing the process of interaction with the public response, journalists are encountering this scale problem in terms of interacting with and moderating users’ content in online commenting systems.

In particular there is a lot that computation can offer to improve online commenting systems, both from the perspective of a journalist dealing with moderation as well as for users of the commenting system. Content analysis, such as natural language processing, computational linguistics, and standard information retrieval techniques can help with both the scale as well as the quality of the discourse by introducing new ways for filtering and organizing comments. For instance, content analysis could be used to rank comments by (1) relevance to the story, (2) subjectivity or objectivity, or (3) degree of politeness. This could aid the process of journalists interacting with readers as well as readers interacting with readers by making it easier to find high quality contributions.

Looking Ahead

Technology is rapidly changing the landscape of how news information is gathered, made sense of, communicated, and disseminated. To pave the way to the future, journalism schools need to train more computationally literate journalists who develop a deep understanding of notions of abstraction, modeling, parameterization, aggregation, scaleability, and programming. And while industry grapples with the culture clash between engineers and journalists as well as the classic innovator’s dilemma, there will be plenty of opportunities for the new computational journalists to reinvent the way news information is gathered, organized, presented, and disseminated.

Histrionic Visualization: The Rise of Theatrical Visual Presentation of Data

Earlier this year, while preparing for a workshop on Telling Stories with Data, I coined the term “Histrionic Visualization”, to account for certain theatrical presentations of information visualizations that I had seen. I wanted to expand on this idea a bit here.

Perhaps the best way to define what I mean by histrionic visualization is to cite some examples. For instance, Al Gore explains climate change data and visualizations in his movie An Inconvenient Truth. Gore combines a linear (sometimes animated) slide deck together with his voice over (and occasional sound effects) to present his data to the audience.

Another example of this idea came about when CNN started using Perceptive Pixel’s touch screen technology to allow on-air journalists the ability to manipulate data on the display using touch and gestures while they were broadcasting. This led to the likes of John King dynamically manipulating election visualizations while on air.

From these examples, hopefully it’s a bit more clear now what I’m talking about when I say “histrionic visualization”. These are embodied presentations of information where the physicality of the presentation itself becomes the defining factor. What new forms of tangible interaction or interfaces could enable further development here?

I think this idea could actually go a lot deeper than the examples I’ve seen too. It seems to me that acting out the presentation of visualizations is an area ripe for study. Does the physicality of the presentation help people learn or crystallize knowledge from the visualization? Are these presentations more engaging? How could you incorporate the audience into the interaction?

Moreover, could this become a new form of art, where talented storytellers weave data and visualization together with acting to engage an audience in the performance? How would the 2010 U.S. census look when presented on stage?

Content Specific Computational Journalism

Much of my prior work in the field computational journalism has focused on building tools that could either be used by journalists or readers in their respective capacities as information producers or consumers.  And the recent Duke CJ Report heavily emphasized the role of computation in informing discovery tools to help journalists uncover new stories in vast corpora of data. With the recent push toward civic data transparency by the US Government, computational accountability tools will be essential to uncovering malfeasance.

But here I’m going suggest something a bit different by setting up a spectrum of computational journalism artifacts along the dimension of content specificity. On one end you have the things I just talked about: tools that help journalists uncover stories and make sense of information. These tools are practically independent of any semantics associated with information but can be customized for different data types (e.g. geographic, time, network etc.). They’re also geared toward insight generation and designed for the kinds of work processes and tasks that journalists engage in on a daily basis.

On the other end of the spectrum there are computationally infused presentations of  stories. A computational journalist might use computation in such a story by making models or data interactive. For example one interactive graphic I worked on for SacBee.com is based on an evaporative water model together with scraped hourly Sacramento weather conditions. The goal was to paint a picture of the model and help people understand when best to water their lawns.

Another example comes from editorial simulations such as September 12th. In that interactive, an editorial model describes the relationship between terrorists and anti-terrorist bombing in the Middle East. But while the model and mechanic are of course described abstractly, the semantics of the graphics and interactions are what is essential to the presentation.

Content specific presentations rely heavily on the semantics of the information to convey meaning. Rather than being generic information tools, they intertwine computation with the story itself. Interaction, information, and visual design become essential to communicating a semantically laden model. And in comparison to generic tools, content specific CJ needs to be designed with a “reader” in mind; to disseminate insights (or opinions) with the public in mind.

There’s value to both kinds of computational journalism: tools to help uncover stories and develop models, and specific presentations to effectively communicate those models.