You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Collocation

Nov 07 2010

A collection as large as the Internet Archives OCR database means I have to think through what I want well in advance of doing it. Im only using a small subset of their 900,000 Google-scanned books, but thats still 16 gigabytesit takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but Ill still have to figure out what sort of relations are worth looking for. So what are some?

Collocation is the most obvious onewhat words appear disproportionately with other words? The Consortium of Historical American English has a good implementation of these, as well as simpler wordcounts, on a 400 million word dataset (about a tenth as many words per year than what Im using right now). I might want to think about some of differences between my set and his, made for linguists, later. For now, Ill notice that the metadata isnt perfect (although, glass houses). For example, one of his demonstrations is about verbs that increase in use between the 1930s and 1970sbut in the top three new verbs is to bialystock, apparently entirely based on stage directions involving Max Bialystock in the screenplay for The Producers. Maybe Ill use Bialystocking to mean erroneously drawing imputation based on random changes in small percentages of a text corpus. A person would never Bialystock, a computer will do it all the time.

Anyhow. COHA uses collocation in terms of word distancewhat appears within six words of a term? I think two other ways will be better for me:

  1. Sentence collocationwhat words appear in the same sentence? Ive already implemented this. Its for finding, say, varying verbs used with my noun, attention.

  2. Text collocationwhat words appear in the same book? I havent implemented it yetit will take a _lot_ more time to run for any given wordbut this might be better for looking at cases like the evolution one, which Ill go into a little more later. It lets us trace, say, whether Darwin is mentioned in books using scientific terms in the 1860s, political terms, etc. Maybe that fall-off in evolution is because it becomes less current in books that use the word biology a lot, but remains more used in books that use the word economics. That would tell us something about diffusion through different fields. Of course, the defining of fields is a big problem in itself, on which more later.