You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Links between words

Nov 23 2010

Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase scientific method--the percentage of occurrences of a word that occur with another phrase. Ive been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicatedI never posted anything from Russia because I couldnt get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Heres the sentence:

How often, compared to what we would expect, does a given word appear with any other given word?

In doing the math, we have to work from the back to the front, so this post is about the last part of the sentence: What does it mean to appear with another word?

Theres actually not much new on this part of the sentencethe real meat comes with the earlier questions. I dealt with most of it in my post on collocation. But let me just review:

A word can appear with another word:

  • Within a certain word radius: Ive used 6 before, which I think is what the Corpus of Historical American English uses, and seems reasonable. One would be a good number, too, particularly if we stripped out prepositions and articlesthen wed get mostly associated adjectives and verbs for any given noun. This data is very hard to store, and has to be recreated for each word.

  • Within the same sentence. I used this a bit with the Scientific Method stuff, and I anticipate using it more when I bring the focus back to attention a bit. Its good for common words and general questions about language. My perl parsing scripts arent perfect, so it tends to chop up some initials into sentences, and the OCR probably misses some periods. Like word radius, its hard to store for a large number of words.

  • Within the same book. This is what Im mostly using now, because a) books are a better container for subject matter than sentences for rarer words, which include most of the isms; and b) its just small enough to fit on my computer, although the queries take a while. It doesnt work as well for truly common words, and the fact that books are of different lengths creates some problems that I might but dont compensate for.

  • Within the same year. I put this in just to point out that semantic and historical categories arent completely separateas I saw analyzing the trend lines for the isms, we found a lot of semantic similarities as well. If Id used year-by-year spikes, there probably would have been more. Any other wrapper could provide interesting information along these linesgenre, publisher, city of publishing, etc. My data is worse on those, though. Author would be a particularly interesting one to use.

So thats all. Im using within the same book right now, but there are other options. Next, Ill talk about how often wed expect two words to appear together.