You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Not included in ngrams: Tom Sawyer

Dec 19 2010

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

But we have no idea what books are in there. Theres no connection to the texts from the data.

Im particularly interested in how they deal with subsequent editions of books. Their methodology (pdf) talks about multiple editions of Tom Sawyer. I think it says that they eliminate multiple copies of the same edition but keep different years.

I thought Id check this. There are about 5 occasions in Tom Sawyer where the phrase Huck said appears with separating quotes, and 11 for said Huck. Both are phrases that basically appear only in Tom Sawyer in the 19th century (the latter also has a tiny life in legal contracts involving huckaback, and a few other places), so we can use it as a fair proxy for different editions. The first edition of Tom Sawyer was 1881: there are loads of later ones, obviously. Heres what you get from ngrams:

Three big spikes around 1900, and nothing before. Until about 1940, the ratio is somewhat consistent with the internal usage in the book, 11 to 5, although said huck is a little overrepresented as we might think. Note:

  • No edition of _Tom Sawyer _shows up until 20 years after its first publication;

  • Theres probably one edition apiece in 1899 and 1901, and two or three in 1903. Those are all around the authorized edition of Twains works. Either theyre catching multiple copies of that edition, or Tom Sawyer was just coming into the public domain (which, for those of you who dont know, is something that used to happen like LPs or smallpox) which led them to rush out the collected edition. Mark Twain and copyright is such a popular issue I cant find the answer right away. I talked at the end of this post about how hard it is to tell that Collected Works of Mark Twain, vol. 1 and Innocents Abroad are the same book. I find a little reassuring that even Google seems to have the same problem. Ive had some success using clustering based on patterns of word use.

So whats the point? I know I said we shouldnt criticize based on metadata; but Im equally irate at the idea that ngrams truly takes the temperature of American culture. Maybe not including Tom Sawyer as part of English, or English One Million, or English Novels is a good example of the shortcomings of this approach.

And thenTom Sawyer _does_ show up in their American English sample before 1899.

American English is supposedly a subset of the English sample, but clearly thats not the case. Somethings wrong here with the data theyre presenting. It doesnt match their own description of it. Thats always a bad thing.

Any ideas what it is?

~~~~~~~~~~~~~

For the record, this works for other books with distinctive character names: Pilgrims Progress, for example, is a little noisier:

Comments:

This is unrelated, but I wanted to post somewhere

Ben - Dec 1, 2010

This is unrelated, but I wanted to post somewhere the ngram for 02138, which, before the invention of zip codes, shows what percentage of books have Harvard library bookplates in them. It falls off completely right in 1922.

Very very good post. Youve included all the g

Clipping Path - May 5, 2014

Very very good post. Youve included all the great information in this post. Thanks a million for that. Cheers!