You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Measuring word collocation, part II

Nov 26 2010

This is the second post on ways to measure connectionsor more precisely, distancebetween words by looking at how often they appear together in books. These are a little dry, and the payoff doesnt come for a while, so let me remind you of the payoff (after which you can bail on this post). Im trying to create some simple methods that will work well with historical texts to see relations between wordswhat words are used in similar semantic contexts, what groups of words tend to appear together. First Ill apply them to the isms, and then well put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence How often, compared to what we would expect, does a given word appear with any other given word? into different components. Now lets look at the central, and maybe most important, part of the questionhow often do we expect words to appear together?

I dont think theres any one best way, to be honest, but I want to get this written down somewhere. Note that part of the challenge here is Im trying to use the information in how many times Darwinism, say, appears in a book that also includes the word Presbyterianism, not just how many books. This is wordcounts done on a limited sample of books, made possible by that database Ive been talking so much about yet that I dont find myself using as much as Id like.

One way is to compare the frequency with which a word appears to a different samplefor example, the proportions I did looking at words that appear with scientific method, or the comparisons that the CoHAE produces between decades. (I think thats how CoHAE works, at least). This is good when comparing between two samples, but you need to have a baseline to compare against. When starting with something like the isms in my dataset as a whole, theres no larger group. And it may be that in the future well want to look at a particular corpusnovels, saywithout using the language of science and cookery and everything else I have now as the standard to judge it by.

So we need to figure out, for each of our words, how often we expect it to appear in the same book as each other word. To answer this, well need to choose an assumption about the distribution of words. I tried two ways, and feel better about the second.

  1. We could assume an even distribution of our word across texts. We would assume that that each word appears randomly across texts, and then multiply the expected number of books for Presbyterianism with the average number of times wed expect Darwinism to appear per book. This assumes that words are not at all lumpy, as I described it last week, though, which we know to generally be not true.

  2. We could take as given the lumpiness of any given word, and use what we know to adjust our expectations. This should strip information what words are more lumpy from the final data, and focus it more on the actual links. This might be a bad ideathat lumpiness data, I was just arguing earlier, tells us important information about how restricted the scope of a words meanings are. But Im going to try it, because the expectations the computer comes out with for words of vastly different commonalities seem skewed to me.

  3. An in-between way, which only occurs to me now, is to project the typical lumpiness at any given word count (that is, use that red loess regression line I plotted on the bookcountswordcounts graph) for a word, to get something in between these two. My blogging is a little behind my playing with the data right now, but I may try that if I come back to this method at some point.