· Very not-close encounters with 18 million books. Plus: What does the Lit Lab have to do with lit?

By ERIC HAND [Nature] – As a reader with a finite amount of time, [Erez] Lieberman Aiden likes to say, you pretty much have two choices. You can read a small number of books very carefully. Or you can read lots of books “very, very not-carefully”. Most humanities scholars abide by the former approach. In a process known as close-reading, they seek out original sources in archives, where they underline, annotate and cross-reference the text in efforts to identify and interpret authors’ intentions, historical trends and linguistic evolution. It’s the approach Lieberman Aiden followed for a 2007 paper in Nature. Sifting through old grammar books, he and his colleagues identified 177 verbs that were irregular in the era of Old English (around AD 800) and studied their conjugation in Middle English (around AD 1200), then in the English used today. They found that less-commonly used verbs regularized much more quickly than commonly used ones: ‘wrought’ became ‘worked’, but ‘went’ has not become ‘goed’. The study gave Lieberman Aiden a first-hand lesson in how painstaking a traditional humanities approach could be.

But what if, Lieberman Aiden wondered, you could read every book ever written ‘not-carefully’? You could then show how verbs are conjugated not just at isolated moments in history, but continuously through time, as the culture evolves. Studies could take in more data, faster. As he began thinking about this question, Lieberman Aiden realized that ‘reading’ books in this way was precisely the ambition of the Google Books project, a digitization of some 18 million books, most of them published since 1800. In 2007, he ‘cold e-mailed’ members of the Google Books team, and was surprised to get a face-to-face meeting with Peter Norvig, Google’s director of research, just over a week later. “It went well,” Lieberman Aiden says, in an understatement.

Continued at Nature |

…but a closer encounter with ‘distance reading’.

By KATHRYN SCHULZ [New York Times] – We need distant reading, [Franco] Moretti argues, because its opposite, close reading, can’t uncover the true scope and nature of literature. Let’s say you pick up a copy of “Jude the Obscure,” become obsessed with Victorian fiction and somehow manage to make your way through all 200-odd books generally considered part of that canon. Moretti would say: So what? As many as 60,000 other novels were published in 19th-century England — to mention nothing of other times and places. You might know your George Eliot from your George Meredith, but you won’t have learned anything meaningful about literature, because your sample size is absurdly small. Since no feasible amount of reading can fix that, what’s called for is a change not in scale but in strategy. To understand literature, Moretti argues, we must stop reading books.

The [Stanford] Lit Lab seeks to put this controversial theory into practice (or, more aptly, this practice into practice, since distant reading is less a theory than a method). In its January pamphlet, for instance, the team fed 30 novels identified by genre into two computer programs, which were then asked to recognize the genre of six additional works. Both programs succeeded — one using grammatical and semantic signals, the other using word frequency. At first glance, that’s only medium-interesting, since people can do this, too; computers pass the genre test, but fail the “So what?” test. It turns out, though, that people and computers identify genres via very different features. People recognize, say, Gothic literature based on castles, revenants, brooding atmospheres, and the greater frequency of words like “tremble” and “ruin.” Computers recognize Gothic literature based on the greater frequency of words like . . . “the.” Now, that’s interesting. It suggests that genres “possess distinctive features at every possible scale of analysis.” More important for the Lit Lab, it suggests that there are formal aspects of literature that people, unaided, cannot detect. [Appended 26 June 2011.]

Continued at The New York Times

From A Stanford Lit Lab pamphlet: Take my Copperfield, please.

By MATTHEW JOCKERS and FRANCO MORETTI – This paper is the report of a study conducted by five people – four at Stanford, and one at the University of Wisconsin – which tried to establish whether computer-generated algorithms could “recognize” literary genres. You take David Copperfield, run it through a program without any human input – “unsupervised”, as the expression goes – and … can the program figure out whether it’s a gothic novel or a Bildungsroman? The answer is, fundamentally, Yes: but a Yes with so many complications that make it necessary to look at the entire process of our study. These are new methods we are using, and with new methods the process is almost as important as the results. [Appended 26 June 2011.]

Continued at The Stanford Literary Lab | More Chronicle & Notices.