You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

State of the Art/Science

Dec 18 2010

As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and youre pretty much guaranteed an explosion of theories and methods.

Some of the theories are deeply interesting. I really like the censorship stuff. That really does deal with books specifically, not culture, so it makes a lot of sense to do with this dataset.  The stuff about half-lives for celebrity fame and particularly for years is cool, although without strict genre controls and a little more context Im not sure what it actually saysit might be something as elegaic as the articles We are forgetting our past faster with each passing year, but there are certainly more prosaic explanations. (Say: 1) footnotes are getting more and more common, and 2) footnotes generally cite more recent years than does main text. I think that might cover all the bases, too.) Yes, the big ideas, at least the ones I feel qualified to assess, are a little fuzzierits hard to tell what to do with the concluding description of wordcounts as a great cache of bones from which to reconstruct the skeleton of a new science, aside from marveling at the BrooksianFreedmanian tangle of metaphor. (Sciences once roamed the earth?) But although a lot of the language of a new world order (have you seen the days since first light counter on their web page?) will rankle humanists, that fuzziness about the goals is probably good. This isnt quite sociobiology redux, intent on forcing a particular understanding of humanity on the humanities. Its just a collection of data and tools that they find interesting uses for, and we can too.

But its the methods that should be more exciting for people following this. Google remains ahead of the curve in terms of both metadata and OCR, which are the stuff of which digital humanities is made. What does the Science team get?

It shows how far were coming on some of the old problems of the digital humanities. Natalie asked in the comments and on her blog about whether OCR was good enough for this kind of work. Well, the OCR on this corpus seems _great. _It makes me want Google to start giving Internet Archive-style access to their full-text files on public-domain books even more. To take a simple example: my corpus of 25,000 books with Internet Archive OCR has about 0.3% as many occurrences of tlie as the. Ngrams has more like 0.025% (division not included). Putting together all the common typos, thats something like one in a thousand errors instead one in a hundred. (Although I wonder why they Google doesnt run some data cleanup on their internal internal OCR; its pretty easy to contextually correctsome common tlis to ths, etc.) That might already be at the level natural language processing becomes feasible. They did filter the texts to keep only ones with good OCR. But thats OK: it makes it easier to target the bad ones now that this team apparently has a good algorithm for identifying what they are.

They get some pretty impressive results, too, out of the Google metadata. (Or as I like to call it, the secret Google metadata.) The researchers purged a lot of entries using the awesomely named serial killer algorithm. Despite its name and the protestations of the methodological supplement (pdf, maybe firewalled?), it doesnt look like it just eliminates magazines; by dropping out entries with no author, for example, its probably just clearing out a lot of entries with bad metadata. (The part of the algorithm that cuts out based on authors takes out more than 10 times the one that looks for publication info in the title field. BTW, they claim to have an appendix that describes the algorithm, but I cant find it on the Science siteany help?)The net results of the metadata filtering seems to have worked quite well; the ngrams results for Soviet Union in the 19th century look better than the Google Books results.Using errors in Google books is not a fair way to criticize ngrams. There are ways to break it, of course, and maybe Ill play some more with that later. But its not bad.

There are certainly problems with their catalogue. They estimate that either 5.8% or 6.2%pages 56 of the supplement are internally inconsistentof their books are off by five or more years, and dont provide the figure for percentage of books off by at least one year. The fact that language was miscategorized in 8% of books makes it questionable how good genre data would be, were Google to include. That and the BISAC problem make me wonder if serious large-scale humanities research isnt really going to use HathiTrust, even if their corpus is significantly smaller.

But my mantra about digital history has always been: No one is in a position to be holier-than-thou about metadata. We all live in a sub-development of glass houses. That applies to pen-and-paper historians as much as it does to digital ones: If youve ever spent any time looking at archival files or oral histories, youve seen dozens of casual misdatings of letters and events that can be impossible to resolve. And if youve spent more than a day, youve certainly been tricked by one or two. Thepaper is perhaps not forthcoming enough about the failings (although Googler Orwant certainly has elsewhere), but compared to what Im working with, at least, they can be somewhat proud of themselves.

And the math is pretty neat. Its great to see the law of large numbers in effect on these textual corporaIm against the three-year smoothing, which most people dont seem to realize is going on, but the efforts to apply similar patterns to large lists of people from wikipedia is great. Things like standard half-life curves for fame are good as a way of testing claims of remarkableness, although using them to produce lists for future research is probably premature.

Alsoone thing Ive found in looking at my own data is that for a lot of things were interested in, the percentages of books that contain a word we want are at least as illustrative as the percentage of words. They capture different aspect of the use/mention spectrum, in some way: Example (using my data, not ngrams), with wordcounts:

And with percentage (per mille, technically) of books using the word:

The ngrams data dump has that information: it will be the next valuable thing they release.

The difference, though, illustrates the most important thing about this data: theres no one right way to read it. Wordcounts give us millions of data points in hundreds of thousands of dimensions to compare against each other. Its up to researchers to figure out how to get those abstract forms projected into a two dimensional image that actually tells us something. But it will never tell us everything, not even close. Looking at the computer screens, were all in Platos cave. Except, maybe, when we actually read the books.