When Literature and Big Data Combine

“Literature is the opposite of data,” wrote the novelist Stephen Marche. Such a statement made sense even a few decades back, but today? Let’s take a look.

Today, Dana Mackenzie’s article says, “the scientific method is tiptoeing into the English department”. Huge amounts of literature have been digitized, and once digitized, surely somebody will start hurling algorithms to find…well, something. In 2011, Google’s N-gram server allowed you to search Google Books for frequency of words or word combinations in the books in its database. There are, of course, obvious limitations to the significance of such raw counts (other than perhaps trending when words caught on or died).

Enter topic modeling:
“A topic-modeling algorithm infers, for each word in a document, what topic that word refers to.”
Does the word “black” mean color? Race? Something bad? The algorithm “produces “bags” of words that belong together”, and leaves it to the human reader to decide the meaning from the context of the other words in the group. That is why Johanna Drucker says this about “digital humanities”:
“It is not a substitute for human reading, but a prosthetic extension of our capacity.”

And with data analysis comes that “indispensable element of the scientific method: falsifiability” (the demand that all theories have claims that can be disproven). This was used by Ted Underwood to test the pattern of “old” Anglo-Saxon words between 1700 and 1900. His finding? They increased a lot in poetry, moderately in fiction and remain unchanged in non-fiction. Such breakdowns can then be used to infer finer details or class differences or whatever.

Of course, not all findings of such data crunching are mind-blowing. Like when Franco Moretti analyzed 7,000 British novels published between 1740 and 1850, he found that the length of the titles came down a lot.

Nonetheless, Matthew Jockers believes that:
“We are reaching a tipping point. Today’s student of literature must be adept at gathering evidence from individual texts and equally adept at mining digital text repositories.”
(Teaching might change too: Jockers assigned more than 1,200 novels in one class. “Luckily for the students, they didn’t have to read them,” he says.)

Melissa Terras warns us that the Garbage In, Garbage Out principle applies:
“Even big data patterns need someone to understand them. And to understand the question to ask of the data requires insight into cultures and history.”
Can’t argue with that part.

Comments

Popular posts from this blog

Why we Deceive Ourselves

Europe #3 - Innsbruck

The Thrill of the Chase