You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Posts with tag collocation


← Back to all posts
Dec 26 2010

Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems Ive had with getting useful data out of this approach are:

Nov 28 2010

In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, its possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I havent seen much using words, though: but it works fairly well. I thought it might help answer Hanks question about the difference between evolutionism and darwinism, but, as youll see, that distinction seems to be a little too fine for now.

Nov 27 2010

What can we do with this information weve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data Ive gathered. Hank asked earlier in the comments about the difference between Darwinism and evolutionism, so:

Nov 26 2010

Now to the final term in my sentence from earlier How often, compared to what we would expect, does a given word appear with any other given word?**.** Lets think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.

Nov 26 2010

This is the second post on ways to measure connectionsor more precisely, distancebetween words by looking at how often they appear together in books. These are a little dry, and the payoff doesnt come for a while, so let me remind you of the payoff (after which you can bail on this post). Im trying to create some simple methods that will work well with historical texts to see relations between wordswhat words are used in similar semantic contexts, what groups of words tend to appear together. First Ill apply them to the isms, and then well put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence How often, compared to what we would expect, does a given word appear with any other given word? into different components. Now lets look at the central, and maybe most important, part of the questionhow often do we expect words to appear together?

Nov 23 2010

Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase scientific method--the percentage of occurrences of a word that occur with another phrase. Ive been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicatedI never posted anything from Russia because I couldnt get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Heres the sentence:

Nov 07 2010

A collection as large as the Internet Archives OCR database means I have to think through what I want well in advance of doing it. Im only using a small subset of their 900,000 Google-scanned books, but thats still 16 gigabytesit takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but Ill still have to figure out what sort of relations are worth looking for. So what are some?