When Literature and Big Data Combine
“Literature is
the opposite of data,” wrote the novelist Stephen Marche. Such a statement made
sense even a few decades back, but today? Let’s take a look.
Today, Dana
Mackenzie’s article says, “the scientific method is tiptoeing into the
English department”. Huge amounts of literature have been digitized, and once
digitized, surely somebody will start hurling algorithms to find…well, something.
In 2011, Google’s N-gram server allowed you to search Google Books for
frequency of words or word combinations in the books in its database. There
are, of course, obvious limitations to the significance of such raw counts
(other than perhaps trending when words caught on or died).
Enter topic
modeling:
“A topic-modeling algorithm infers, for
each word in a document, what topic that word refers to.”
Does the word
“black” mean color? Race? Something bad? The algorithm “produces “bags” of
words that belong together”, and leaves it to the human reader to decide the
meaning from the context of the other words in the group. That is why Johanna
Drucker says this about “digital humanities”:
“It is not a substitute for human
reading, but a prosthetic extension of our capacity.”
And with data
analysis comes that “indispensable element of the scientific method:
falsifiability” (the demand that all theories have claims that can be
disproven). This was used by Ted Underwood to test the pattern of “old”
Anglo-Saxon words between 1700 and 1900. His finding? They increased a lot in
poetry, moderately in fiction and remain unchanged in non-fiction. Such
breakdowns can then be used to infer finer details or class differences or
whatever.
Of course, not
all findings of such data crunching are mind-blowing. Like when Franco Moretti
analyzed 7,000 British novels published between 1740 and 1850, he found that
the length
of the titles came down a lot.
Nonetheless, Matthew
Jockers believes that:
“We are reaching a tipping point. Today’s
student of literature must be adept at gathering evidence from individual texts
and equally adept at mining digital text repositories.”
(Teaching might
change too: Jockers assigned more than 1,200 novels in one class. “Luckily for
the students, they didn’t have to read them,” he says.)
Melissa Terras
warns us that the Garbage In, Garbage Out principle applies:
“Even big data patterns need someone to
understand them. And to understand the question to ask of the data requires
insight into cultures and history.”
Can’t argue with
that part.
Comments
Post a Comment