You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Comparing Corpuses by Word Use

Oct 06 2011

Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunnings Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.

What are some interesting, large corpuses to compare? A lot of what well be interested in historically are subtle differences between closely related sets, so a good start might be the two Library of Congress subject classifications called History of the Americas, letters E and F. The Bookworm database has over 20,000 books from each group. Whats the difference between the two? The full descriptions could tell us: but as a test case, it should be informative to use only the texts themselves to see the difference.

That leads a tricky question. Just what does it mean to compare usage frequencies across two corpuses? This is important, so let me take this quite slowly. (Feel free to skip down to Dunning if you just want the best answer Ive got.) Im comparing E and F: suppose I say my goal to answer this question:

What words appear the most times more in E than in F, and vice versa?

Theres already an ambiguity here: what does times more mean? In plain English, this can mean two completely different things. Say E and F are exactly the same overall length (eg, each have 10,000 books of 100,000 words). Suppose further presbygational (to take a nice, rare, American history word) appears 6 times in E and 12 times in F. Do we want to say that it appears two times more (ie, use multiplication), or six more times (use addition)?

It turns out that neither of these simple operations works all that well. In the abstract, multiplication probably sounds more appealing; but it turns out to only catch extremely rare words. In our example set, here are the top words that distinguish E from F by multiplication, by occurences in E divided by occurrences in F. For example, gradualism appears 61x more often in E than in F.

daimyo        aftre    exercitum intransitive      castris        1994a
   114.00000    103.00000    101.00000     82.33333     81.66667     77.00000
       sherd        infti   gradualism  imperforate      equitum       brynge
    71.71429     66.00000     61.00000     59.33333     57.00000     56.00000

(BWT, I simply omit the hundreds of words that appear in E but never appear in F; and I dont use capitalized words because they tend to _very_ highly concentrated and in fictional works in particular can cause very strange results. Yes, thats not the best excuse.)

So what about addition? Compensating for different corpus sizes, its also pretty easy to find out the number of more occurrences than wed expect based on the previous corpus. (For example, not appears about 1.4 million more times than wed expect in E given the number of times it appears in F and the total number of words in E.)

to      that       the       not       had        it
3432895.3 2666614.4 2093465.8 1427220.8 1360559.0 1342948.2
       be   general        we       but       our     would
1208340.5  990988.4  974849.0  841842.6  819680.5  798426.0

Clearly, neither of these is working all that well. Basically, the first group are so rare they dont tell us much: and the second group, with the intriguing addition of general, are so common as to be uninformative. (Except maybe for our; more on that later). Is there any way to find words that are interesting on _both_ counts?

I find it helpful to do this visually. Suppose we make a graph. Well put the addition score on the X axis, and the multiplication one on the Y axis, and make them both on a logarithmic scale. Every dot represents a word, as it scores on both of these things. Where do the words were talking about fall?

This nicely captures our dilemma. The two groups are in opposite corners, and words dont ever score highly on both. For example, general appears about 1,000,000 occurrences more in class E than wed expect from class F, but only about 1.8x as often; sherd appears about 60x more often in class E, but that adds up to only 500 extra words compared to expectations, since its a much rarer word overall.

(BTW, log-scatter plots are fun. Those radiating spots and lines on the left side have to do with the discreteness of our set; a word can appear 1 time in a corpus or twice in a corpus, but it cant appear 1.5 times. So the lefternmost line is words that appear just once in F: the single point farthest left, at about (1.1,2.0) is words that appear twice in E and once in F; a little above it to the right is words that appear three times in E and once in F; directly to its right are words that appear four times in E and twice in F; etc.)

One possible solution would be to simply draw a line between daimyo and that, and assume that words are interesting to the degree that they stick out beyond that line. That gives us the following word list, placed on that same chart:


which is a lot better. The words are specific enough to be useful, but common enough to be mostly recognizable. Still, though, the less frequent words seem less helpful. Are sherd and peyote and daimyo up there because they really characterize the difference between E and F, or because a few authors just happened to use them a lot? And why assume that that and daimyo are equally interesting? Maybe that actually _is_ more distinctive than daimyo, or vice-versa.

To put it more formally: words to the left tend to be rarer (for a word to have 100,000 more occurrences than wed expect, it has to be quite common to begin with); and there are a lot more rare words than common words. So by random chance, wed expect to have more outliers on the top of the graph than on the bottom. By using Bookworm to explore the actual texts, I can see that daimyo appears so often in large part because Open Library doesnt recognize these two books are the same work. Conversely, that our appears 20% more often in E than in F is quite significant; looking at the chart, it seems to actually hold true across a long period time. If this is a problem with 20,000 books in each set, it will be far worse when were using smaller sets.

That would suggest we want a method that takes into account the possibility of random fluctuations for rarer ones. One way to do this is a technique called, after its inventor, Dunnings log-likelihood statistic. I wont explain the details, except to say that like our charts it uses logarithms and that it is much more closely to our addition measure than to the multiplication one. On our E vs F comparison, it turns up the following word-positions (in green) as the 100 most significantly higher in E than F:

Dunnings log-likelihood uses probabilistic statistics to approximate a chi-square test; as a result, the words it identifies tend to come from the most additively over-represented, but it also gives some credit for multiplication. All of the common words from our initial sets of 12 additive words, and none of the rare ones, are included. It includes about half of the words my naive straight-line method produced: even skirmisher, which seemed to clump with the more common words, isnt frequent enough for Dunning to privilege it over a blander word like movement.

Is this satisfying? I should maybe dwell on this longer, because it really matters. Dunnings is the method that seems to be most frequently used by digital humanities types, but the innards arent exactly what you might think. In MONK, for example, the words with the highest Dunning scores are represented as bigger, which may lead users to think Dunning gives a simple frequency count. Its notits fundamentally a probability measure. We can represent it like it has to do with frequency, but its important to remember that its not. (Whence the curve on our plot).

Ultimately, whats useful is defined by results. And I think that the strong showing of common words can be quite interesting. This ties back to my point a few months ago that stopwords carry a lot of meaning in the aggregate. If I didnt actually really find the stopwords useful, Id be more inclined to put some serious effort into building my own log-difference comparison like the straight line above; as it is, Im curious if anyone knows of some good ones.

As for results, heres what Dunnings test turns up in series E and in series F, limiting ourselves to uncapitalized words among the 200,000 most common in English:

Significantly overrepresented in E, in order**:**

[1] that general army enemy

[5] not slavery to you

[9] corps brigade had troops

[13] would our we men

[17] war be command if

[21] slave right it my

[25] could constitution force what

[29] wounded artillery division government

Significantly overrepresented in F, in order:

[1] county born married township

[5] town years children wife

[9] daughter son acres farm

[13] business in school is

[17] and building he died

[21] year has family father

[25] located parents land native

[29] built mill city member

At a first pass, that looks like local history versus military history.

At a second pass, well notice constitution and government in F and he and parents in E and realize that E might include biographies as well as local histories, and that F probably includes a lot of legal and other forms of national histories as well. The national words might not have turned up by my straight-line test, which seemed intent on finding all sorts of rarer military words (skirmishers, for example).

Looking at the official LC classification definition (pdf), that turns out to be mostly be the case. (Except for biographythat ought to mostly be in E. That it isnt is actually quite interesting.) So this is reasonably good at giving us a sense of the differences between corpuses as objectively defined. So far, so good.

But these lists are a) not engaging, and b) dont use frequency data. How can we fix that? I never thought Id say this, but: lets wordle! Wordle in general is a heavily overrated form of text analysis; Drew Conway has a nice post from a few months ago criticizing it because it doesnt use a meaningful baseline of comparison, and uses spatial arrangement arbitrarily. Still, its super-engaging, and possibly useful. We can make use of the Dunning data here to solve the first problem though not the second. Unlike in a normal Wordle, where size is frequency, here size is Dunning score: and the word clouds are paired, so each one represents two ends of a comparison. Heres a graphic representing class E:

 

And then Class F:

(We could also put them together and color-code like MONK does, but I think its easier to get the categories straight by splitting them apart like this). One nice thing about this is that the statistical overrepresentation of county in class F really comes through.

On some level, this is going to seem unremarkablewere just confirming that the LC description does, in fact, apply. But a lot of interesting thoughts can come from the unlikely events in here. For example, our and we are both substantially overrepresented in the national histories as opposed to the local histories. (BTW, I should note somewhere that both E and F include a fair number of historical _documents_, speeches, etc., as well as histories themselves. Here, Im lumping them all together.) Theres no reason this should be solocal histories are often the most intensely insular.

Is there a historical pattern in the second-person-plural? Bookworm says yes, emphatically--in a quite interesting way. Wes and Uss are similar across E and F in the early republican period, and then undergo some serious wiggling starting around the Civil War; that leads to a new equilibrium around 1880 with E around its previous height, and F substantially lower.

Now, there doesnt have to be an interesting historical explanation for this. Maybe its just about memoirs switching from F to E, say. But there might be: we could use this sort of data as a jumping off point for some explorations of nation-building and sectionalism. For example, clicking on the E results around 1900 gives books that use the words we and our the most. One thing I find particularly interesting there are the presence of many books that, by the titles at least, Id categorize as African-American racial uplift literature. (Thats a historians category, of course, not a librarians one). If we were to generalize that, it might suggest the rise of several forms of authorial identification with national communities (class, race, international, industrial) in the late nineteenth century, and a corresponding tendency to not necessarily see local history as first-person history. Once we start to investigate the mechanics of that, we can get into some quite sophisticated historical questions about the relative priority of different movements in constructing group identities, connections between regionalism in the 1850s vs. (Northern?) nationalism in the 1860s, etc.

We arent restricted to questions where the genres are predefined by the Library of Congress. There is a _lot_ to do with cross corpus comparisons in a library as large as the Internet Archive collection. We can compare authors, for example: Ill post that bit tomorrow.

This isnt stuff that Martin and I could integrate into Bookworm right away, unfortunately. It simply takes too long. The database driving Bookworm can add up the counts for any individual word in about half a second; it takes more like two minutes to add up all the words in a given set of books. For a researcher, thats no time at all; but for a website, its an eternity. Both from the user end (no one will wait that long for data to load) and from the server end (we cant handle too many concurrent queries, and longer queries means more concurrent ones).

But Wordle clouds and UI issues aside, the base idea has all sorts of applications Ill get into more later.

Comments:

This is brilliant stuff. Ive been taking the

Anonymous - Oct 5, 2011

This is brilliant stuff. Ive been taking the value of Dunnings more or less on faith. By inspecting the formula I could see that it struck some kind of compromise between the measures of difference youre characterizing as multiplicative and additive, but graphing that as a log scatterplot is a fabulous idea. It makes visible exactly what kind of compromise the formula in practice strikes.

Im very much of the opinion that, as you say, whats useful is defined by results, and Ive sometimes felt that the Dunnings results I get from MONK emphasize stopwords more than I want (perhaps just because Im not good at interpreting them). I might be a tiny bit tempted to tinker with the Dunnings algorithm by varying the base that gets used for the log function. Its supposed to be log base e, but by fiddling with the base I bet you could stretch the green region a bit toward the y axis