Structured Data vs. Narrative

November 2nd, 2009 § 3

On Friday, Dan Conover published a post explaining his vision for the place of long-form narrative relative to other story packages, structured data in particular. His first point is that narrative, the go-to format for classically trained journalists, is often a poor fit for the information at hand:

I already understood that stories are the way people make sense of their lives, but it was during 2004-05 that I began to see how journalistic narrative was distorting the way we viewed the world.

Dan’s self-described “money quote” for the piece is this:

Today’s revolution isn’t about killing narrative, but about inventing box scores for actions that don’t take place in ballparks.

Narrative is more obviously capable of distorting our understanding of the world, but structured data has a similar effect. Sabermetrics can’t tell you how players’ personal lives affect their teams performances because someone judged that such details should not be included in box scores.

In comparison with prose, “structured” data really means more simple, more granular structure. An essay is as much a data structure as any set of box scores or relational database. Simple granular data structure is attractive compared to prose narrative for two reasons:

  1. Information in simple granular structures can afford to be more concise because the context of that information is provided by the structure, rather than by more content.
  2. It facilitates algorithmic automation — it’s easier to make a computer do the boring parts of interpretation, like searching or calculating confidence distributions.

The implication of Dan’s first thesis, or at least the way he worded it in his long-form version, is that structured data doesn’t distort the way we view the world. It does. I don’t think Dan actually believes in the objectivity of data, but it’s a conversation worth having. Like Dan says, narrative tends to impose a conflict and a resolution on a story, but any data structure will impose something on the facts. Structure is as structure does, whether narrative, database or otherwise. The trick is to choose a structure that represents what we decide is valuable to know. Journalism itself has a built-in bias towards covering information meeting the criteria of newsworthiness:

  • Timing
  • Impact
  • Proximity
  • Prominence
  • Human Interest
  • Exception
  • Conflict
  • Whatever floats your editor’s boat

The difference I want to highlight is that, with structured data, the judgement of what is and isn’t important is no longer on a case-by-case basis. A reporter writing a traditional article can decide that one eyewitness quote is more important than another. A database may as well store them all and paint a much bigger picture, but another database may only store the official police report. The judgements are in the structure, and then inherited any observation of reality you record in that structure.

I’m not calling that a good or a bad thing, only something to be mindful of.

How do journalists verify data?

October 27th, 2009 § 2

In early September, I was arguing with Chicago journalist Robb Montgomery ([at] robbmontgomery) via Twitter over the journalistic value of automated data organization. For anyone who doesn’t know, EveryBlock is a Web site that scrapes information from dozens of different data sources and organizes it geographically. The result is an organized portal for local information. Robb claims that that isn’t journalism. I say it is. It was originally funded by a Knight News Challenge grant, after all, so at least it’s trying to be journalism.

At one point, Robb asked if EveryBlock fact-checks the data it uses. That caught me off-guard. Why would you want to fact-check data? Where do you begin if you decide you want to? It’s a perfectly relevant question, though. Fact-checking is a big deal for journalism. “If your mother tells you she loves you, get two other sources to confirm it,” the saying goes.

Robb was arguing that a journalist is better at fact-checking than a programmed bot, which makes sense — if you’re only fact-checking a single crime, or a handful of crimes at most. Lets see that hypothetical journalist follow up on the details of every reported crime in Chicago. Even just for yesterday. You might as well fact-check every pixel in a photo. Even if you can, what journo-economic justification might you have for actually doing it? Now is not the time for journalists to waste time, and therefore money, where they can avoid it.

I’m guessing that that isn’t what Robb had in mind. As far as I can tell, Robb thinks of each datum as a statement of fact. Each datum aspires to be a representation of some truth.

My formal education in math and my professional experience as a database application developer cultivated a very different understanding. Each datum represents anything from what is probable to what is plausible to what we hope is true, depending on the context. In the context of journalism, as with science, each datum is merely a piece of evidence, like a quote or a photo — an observation of truth (with a perspective, and therefore bias), not truth itself in the sense romanticized by “classical” journalism education.

Robb and I clearly have different expectations of any data-based product with the label “journalism” attached to it. The two possibilities I see, which could both be true, are:

1) Robb expects each datum to be a self-contained representation of truth requiring the redundancy of journalistic scrutiny, whereas I expect each datum to be mere evidence in support of some related truth;

OR

2) Robb and I share the same understanding of data as evidence, and Robb expects anything called “journalism” to synthesize the evidence into rigorous statements of fact, whereas I’m satisfied to present organized piles of evidence and expect audiences to connect the dots (or not) for themselves.

So what standards does a collection or presentation of data need to satisfy to be a self-contained journalism product?