Posts with tag collocation
← Back to all posts
Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems I’ve had with getting useful data out of this approach are:
In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it’s possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven’t seen much using words, though: but it works fairly well. I thought it might help answer Hank’s question about the difference between evolutionism and darwinism, but, as you’ll see, that distinction seems to be a little too fine for now.
What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between “Darwinism” and evolutionism, so:
Now to the final term in my sentence from earlier— “How often, compared to what we would expect, does a given word appear with any other given word?”**.** Let’s think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.
This is the second post on ways to measure connections—or more precisely, distance—between words by looking at how often they appear together in books. These are a little dry, and the payoff doesn’t come for a while, so let me remind you of the payoff (after which you can bail on this post). I’m trying to create some simple methods that will work well with historical texts to see relations between words—what words are used in similar semantic contexts, what groups of words tend to appear together. First I’ll apply them to the isms, and then we’ll put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?
Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase “scientific method”--the percentage of occurrences of a word that occur with another phrase. I’ve been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicated–I never posted anything from Russia because I couldn’t get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Here’s the sentence:
A collection as large as the Internet Archive’s OCR database means I have to think through what I want well in advance of doing it. I’m only using a small subset of their 900,000 Google-scanned books, but that’s still 16 gigabytes–it takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but I’ll still have to figure out what sort of relations are worth looking for. So what are some?