You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Picking texts, again

Jan 28 2011

Im trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. Ive been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). Ive avoided blogging the really boring stuff, but Im going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

So. Say, for the reasons I gave in my rant about copyright, that you want to look at books from the period 1830 to 1922  (I chose those years as bounded on the one side by the full emergence of a domestic printing industry, and on the other by the copyright barrier). Say, as well, that you want books printed in the United States. And you want to do some text analysis. What are your options?

Youll start with Ngrams. Google Books has about 365,000 books in the American corpus, or about 36 billion words for the period were talking about. We all know the problems at this point: the web interface is simplistic, the underlying corpus is opaque, the capitalization treatment means that nothing is as simple as it seems. The interface is fast enough and clean enough that youll forgive it that at first. But sooner at later, youre going to bump up against the corners of that world and want some more. You might try downloading some ngrams data, but a) its _huge_ for the multi-grams, and b) you want to know what youre doing.

Then youll want to move on to Mark Davies Corpus of Historical American English. COHA has 400 million words from 1810-2009, or about 190 million words for my period (assuming theyre evenly distributed, which Davies says they are). That works out to about the equivalent of 2,000 books. If you want more detail, its perfect. Its got semantic tagging and generally much better metadata than Google (although not without flaws: if people were half as interested in picking that apart as they are in ngrams theyd find all sorts of funny examples comparable to the Producers script taking the stage direction to Bialystock as a verb.)

That is to say, ngrams and COHA are about two orders of magnitude apart from each other. (All the books ever written, whatever that means, is probably about an order of magnitude or two larger than ngrams based on their numbers.) Conveniently, the corpus Ive been using is exactly in the middle, logarithmically speaking: its about 27,000 books, and 2,718,572,631 words. (Better described asabout 3 billion words”all of these counts depend on a bunch of caveats and assumptions.) To put it graphically:

 

(FTR this is only size, not a Venn diagram: COHA includes a number of sources that ngrams doesnt, such as periodicals. Incidentally, MONK, which I havent played around with at all because the texts dont seem to ones Im interested in on initial inspection, seems to be roughly on the COHA order of magnitude and with the same quality of metadata.)

I could get bigger than that if I wanted based on the archive.org texts. There are a lot thereit might possible to build a base set even larger than ngrams. If someone buys me a nicer computer, maybe I will. Or to keep this in the second person: you could. The major constraint with all of this is just time, since lots of queries scale linearly with size (and I routinely run processes that take several hours to run) and some, like pairwise comparison, scale exponentially. MySQL drags its feet when returning tens of thousands of results from a table with hundreds of millions of recordssome of my most basic queries currently take about 2 seconds to run, and I dont want them to get

[Incidentally, just to keep my copyright kick going for a second: one unexpected side-effect of wanting to deal with all these public-domain books is that if, hypothetically, I had had a large number of non-public domain files on my hard drive, I would have had to throw them out to make room for the new, public domain files. Another argument that widening copyright laws will help the RIAA!]

But back to my project. I think theres actually a lot to be said for slightly smaller corpuses (ie, on the COHA/MONKme scale), if we know why were choosing them.Theres something to be said for samples ifand this is importantonly if theyre wisely chosen. The funniest part of the culturomics FAQ is, pretty obviously, this:

10. Why didnt you account for how many copies of each book were printed?

Because that is totally impossible.

Theyre right. Its a total pipe dream, never going to happen. But there might be other ways to get closer. After all, its a question because behind it is an important point: not all books are created equal. Some are more popular than others. Whether thats because they shaped the culture or because they represented it particularly well isnt really the point. (Although I do admit, Ive been thinking about finding what books from 1880 or so best match the typical lexical profile of a book from 1900, just to see what turns up.)  What matters is that a perfect systemwhich well never havewouldnt be based around equally waiting every edition of every book.

I mentioned before the bar charts Bob Darnton collects and criticizes in  The Literary Underground of the Old Regime. They are based on private library records for individuals to try get at what people owned, not just whats survived. I think that public/academic library copies can be an OK, if not a good, proxy for original publication runs in certain fields (like the psycholgy texts Im interested in) as well. In the age of the Internet, we can get a decent idea of publication runs from existing copies found on the web. (I spent rather too much time reading Jürgen Espenhorsts book about 19th-century German atlases last year, and Im pretty sure that extant copies do seem to correlate a bit with original runs where he has both data.) If you could cut a deal with OCLC, there might be a case right now for weighting publications by number of surviviving library copies, or limiting your corpus to books that exist in above 4 research libraries, or something like that, just to filter out all the self-published nonsense in the nineteenth century.

In any case: Im not planning on doing that. But I do want to keep limiting my sets to about 30,000 books, mostly because I dont think my computer can handle much more. I want to create a few special onesone with everything I can find in psychology, education, and advertising, for instance, since thats what I study. But theres still a good use for a general-purpose corpus like Ngrams or COHA.I could do that just randomly sampling out of all the english-language, US-published texts I havebut since books arent created equal, I dont see a compelling reason to just taking them randomly. (One good option would be to segment for better OCR quality, though, which is part of what Culturomics did.)

Instead, Im deeply attached to the stopgap I started this whole project with:l imiting my result sets by publisher. I should say right now: I love this idea. It was only by thinking to do it that I became convinced the Internet Archive stores were navigable at all. I think using literary publishers, or popular publishers, does a great job of weeding out books that editors thought would sell well, or thought were important, and that excludes certain classes of books (government reports, some later academic texts) that were published for reasons entirely different from the belief that anyone would read them. But whats the best way to do it going forward?

My plan is to switch over to Open Library metadata. Open Library is supposed to be the wikipedia of libraries, sponsored as a sort of front-end to the Internet Archive. But no ones really interested in writing library entries the way they are in writing encyclopedia. Do you remember early Wikipedia, when most articles were either verbatim from the 1911 Britannica or written by perl scripts culling census data? Thats basically what Open Library is, with library catalog information. Im going to assume theyre better aggregators than I would be. And the data itself is much better than I initially thought.

Open Library data has some problems, though. So Im going to occasionally occasionally supplemented by metadata from a) HathiTrustwhich has, presumably, better catalog data from very good librariesand b) the Internet Archive, which does have some metadata fields Open Library seems to lack, including, critically, volume number (open Library makes a total mess of multi-volume works) and the original library contributor for a scanned text.

Given that, Im trying to build two corpuses at the moment. I built up a master catalog of every book in the open library catalog that a) either has an OL library of congress-style call number or has one I can match from Hathi records, b) has a text file on the Internet Archive, and c) doesnt seem to be a duplicate of another book Im already getting. A is important because I find I really care about genre. B is necessary because neither Google nor Hathi currently make it possible to download thousands of text files. C is a huge problem, and cant be solved except by actually comparing the texts of books computationally to see if they are duplicates. I suspect, actually, that ngrams didnt do enough of thisthe example Ive given before is that one librarian lists a book as works of Mark Twain, vol. 1 while another lists it as Innocents Abroad. Only extreme metadata scrubbing is going to catch that problem. (Although I think the Ngrams method might actually exclude the first version because it regards anything with volumes as a serial.) Anyway, multiple books is the problem I worry about skewing data the most: it really messes with correlations, it creates frequency outliers, and generally makes data less trustworthy.

I want to take that catalog to build, first, a general-purpose list using a select set of publishers like I did before. This will be be a larger list of publishers than before so a) I get a few houses west of the Susquehanna and b) I get enough books with the more limited set of Internet Archive books that have Open Library web pages. Once I settle on a list, Ill do a post breaking down a little bit of that.

Then, like I said, I also want to take all the books in English American, British, Indian, whatever in the fields of psychology, physiology, education, and advertising and put them in a corpus. Cross linguistic comparisons are nearly impossible, but I might try French and German for these fields too just to see if theres any way to get it working. That should let me run a couple finer grained-analysis that Im interested inwhat characterized the emergence of psychology from philosophy and physiology in the late 19th century, where and when characteristic language from psychology permeated into education and vice-versa, and so on. Thats more for actually writing something for a presentation in March at MAW, but maybe a little will sneak on to this blog.