Note: this is cross-posted on the CUNY Tow-Knight Center for Entrepreneurial Journalism site.
Recently there’s been a surge of interest in automatically generating news stories. The poster child is a start-up called Narrative Science which has earned coverage by the likes of the New York Times, Wired, and numerous blogs for its ability to automatically produce actual, readable stories of things like sports games or companies’ financial reports based on nothing more than numeric data. It’s impressive stuff, but it doesn’t stop me from thinking: What’s next? In the rest of this post I’ll talk about some challenges, such as story schema and modality, data context, and text transparency, that could improve future story generation engines.
Without inside information we can’t say for sure exactly how Narrative Science (NS) works, though there are some academic systems out there that provide a suitable analogue for description. There are two main phases that have to be automated in order to produce a story this way: the analysis phase and the generative phase. In the analysis phase, numeric data is statistically analyzed for things like trends, clusters, patterns, and outliers or exceptions. The analysis phase also includes the challenging aspect of condensing or selecting the most interesting things to include in the story (see Ramesh Jain’s “Extreme Stories” for more on this).
Followed by analysis and selection comes the task of figuring out an interesting structure to order the information in the story, a schema. Narrative Science differentiates itself primarily, I think, by paying close attention to the structure of the stories it generates. Many of the precursors to NS were stuck in the mode of presenting generated text in a chronological schema, which, as we know is quite boring for most stories. Storytelling is really all about structure: providing the connections between aspects of the story, its actors and setting, using some rhetorical ordering that makes sense for and engages the reader. There are whole books written on how to effectively structure stories to explore different dramatic arcs or genres. Many of these different story structures have yet to be encoded in algorithms that generate text from data, so there’s lots of room for future story generation engines to explore diverse text styles, genres, and dramatic arcs.
It’s also important to remember that text has limitations on the structures and the schema it supports well. A textual narrative schema might draw readers in, but, depending on the data, a network schema or a temporal schema might expose different aspects of a story that aren’t apparent, easy, or engaging to represent in text. This leads us to another opportunity for advancement in media synthesis: better integration of textual schema with visualization schemas (e.g. temporal, hierarchical, network). For instance, there may be complementary stories (e.g. change over time, comparison of entities) that are more effectively conveyed through dynamic visualizations than through text. Combining these two modalities has been explored in some research but there is much work to do in thinking about how best to combine textual schema with different visual schema to effectively convey a story.
There has also been recent work looking into how data can be used to generate stories in the medium of video. This brings with it a whole slew of challenges different than text generation, such as the role of audio, and how to crop and edit existing video into a coherent presentation. So, in addition to better incorporating visualization into data-driven stories I think there are opportunities to think about automatically composing stories from such varied modalities as video, photos, 3D, games, or even data-based simulations. If you have the necessary data for it, why not include an automatically produced simulation to help communicate the story?
It may be surprising to know that text generation from data has actually been around for some time now. The earliest reference that I found goes back 26 years to a paper that describes how to automatically create written weather reports based on data. And then ten years ago, in 2002, we saw the launch of Newsblaster, a complex news summarization engine developed at Columbia University that took articles as a data source and produced new text-based summaries using articles clustered around news events. It worked all right, though starting from text as the data has its own challenges (e.g. text understanding) that you don’t run into if you’re just using numeric data. The downside of using just numeric data is that it is largely bereft of context. One way to enhance future story generation engines could be to better integrate text generated by numeric data together with text (collected from clusters of human-written articles) that provides additional context.
The last opportunity I’d like to touch on here relates to the journalistic ideal of transparency. I think we have a chance to embed this ideal into algorithms that produce news stories, which often articulate a communicative intent combined with rules or templates that help achieve that intent. It is largely feasible to link any bit of generated text back to the data that gave rise to that statement – in fact it’s already done by Narrative Science in order to debug their algorithms. But this linking of data to statement should be exposed publicly. In much the same way that journalists often label their graphics and visualizations with the source of their data, text generated from data should source each statement. Another dimension of transparency practiced by journalists is to be up-front about the journalist’s relationship to the story (e.g. if they’re reporting on a company that they’re involved with). This raises an interesting and challenging question of self-awareness for algorithms that produce stories. Take for instance this Forbes article produced by Narrative Science about New York Times Co. earnings. The article contains a section on “competitors”, but the NS algorithm isn’t smart enough or self-aware enough to know that it itself is an obvious competitor. How can algorithms be taught to be transparent about their own relationships to stories?
There are tons of exciting opportunities in the space of media synthesis. Challenges like exploring different story structures and schemas, providing and integrating context, and embedding journalistic ideals such as transparency will keep us more than busy in the years and, likely, decades to come.