You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

What good are the 5-grams?

Dec 23 2010

Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. Theres just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, theres no reason to use the ngrams data rather than just downloading the original books, because:

  1. Ngrams are not complete; and

  2. Were they complete, they wouldnt offer significant computing benefits over reading the whole corpus.

Edit: let me intervene after the fact and change this from a rhetorical to a real question. Am I missing some really important research applications of the 5-grams in what follows? Another way of putting it: has the dump that Google did for the non historical ngrams in 2006 been useful in serious research? I dont know, but I suspect it might have been.

Google only includes a word or phrase in ngrams if it appears a certain number of times40 books, or something like that. If you plug a random phrase from a book with only a few editions into ngrams, it shows up blank. When you get up to five words, youre losing a lot of information. So say, using Cohens example, youre looking for marriage; you can use the two grams to find loving marriage and happy marriage, but if you want to find words a little farther away (marriage between two equals, marriage of two equals, [marriage between equals]) the engine breaks down. Of those three, two of the individual formulations are too rare to show up in the data. Maybe they dont show up at allbut maybe they show up twenty times. No way to know. You really want to be able to look for marriage within five words of equal or equals. (Mark Davies COHA lets you do that, but only through a web interface without access to underlying dataand Im talking about raw data processing here, not nice web interfaces). But theres no way to do that at all, unless youre looking at phrases that you know to be stereotyped: the united states is vs the united states are, etc.

Why doesnt Google just release all the 5-grams? First off, it might create copyright problems: A computer could take each of the strings and stitch them together: to be or not to overlaps with be or not to be, etc, until it had a whole book. For once, the genomics analogy is accuratethats how they reconstructed the genome out of DNA fragments.

But say Google got around that by chopping off grams at the sentence line, or introducing error, or something. For every edition of Dickens, youd get the following five-grams for the first sentence alone (spaces just to make it clearer):

best of times it was
                of times it was the
it was the best of
                         it was the worst of
       the best of times it
                   times it was the worst
   was the best of times
                            was the worst of times

And they wouldnt even be sorted alphabeticallytheyd probably be scattered across 6 files. If you wanted to find words that appear near times, youd have to scan six grams and 30 words just to find the 8 words that appear near it; and then youd have to do some more processing to see that was appears three times, not five times, near times since some of them are double-counted.

Its doable, but its a monstrously inefficient way of storing that sentence for word-collocation purposes. It works for Googles computing power and for creating graphs like the ones they do, but I suspect there are better ways to do it (indexes of word locations followed by local processing of sentences, or something) if your goal is to find the answer to one or a dozen questions, rather than allow thousands of people to hit your servers with various requests.

The corpus of pre-copyright texts is not _so_ large (Ngrams American books from 1830 to 1922 looks to be about 180 GB) that we cant do it on our laptops. And there are significant benefits to allowing this sort of research to be done by individuals, rather than forcing humanists into a grant-chasing access wars just to access to books in the public domain.

So whats my agenda, then? I think we should hope for a few things:
A) For all the academic parties clamoring for open access, its only theInternet Archive thats actually released texts (not PDFs) into the wild. Google, Hathi, individual libraries: everyone else is holding back on this. Thats a shame. 
B) The next best thing to full text is word counts for books. Im sure this is a completely quixotic request, but I think text word counts should be public domain even for books in copyrightits just an assemblage of statistics, after all. I have to pay Major League Baseball to watch video of games, but I dont have to pay them for box scores. Id love that for copyright-era books. Probably not going to happen, though.
C) Something about the potential of the Zotero commons for academic crowd-sourced texts and cataloging. Im fuzzy from my cold, I cant finish this thought, though.

Anyway, I need to stop blogging about ngrams. It doesnt really bring out the best, I dont think.