You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Lumpy words

Nov 17 2010

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}

What are the most distinctive words in the nineteenth century? Thats an impossible question, of course. But as I started to say in my first post about bookcounts, [link] we can find something usefulthe words that are most concentrated in specific texts. Some words appear at about the same rate in all books, while some are more highly concentrated in particular books. And historically, the words that are more highly concentrated may be more specific in their meaningsat the very least, they might help us to analyze genre or other forms of contextual distribution.

Because of computing power limitations, I cant use all of my 200,000 words to analyze genreI need to pick a subset that will do most of the heavy lifting. Im doing this by finding outliers on the curve of words against books. First, Ill show you that curve again. This time, Ive made both axes logarithmic, which makes it easier to fit a curve. And the box shows the subset Im going to zoom in on for classificationdropping a) the hundred most common words, and b) any words that appear in less than 5% of all books. The first are too common to be of great use (and confuse my curve fitting), and the second are so rare Id need too many of them to make for useful categorization.

(more after the break)

OK, so thats what weve had so far. Now lets look at that chunk. Im going to use loess to fit a line to it, and select the words that are the greatest outliers from that line. Red shows words who are distributed across fewer booksmore lumpilythan wed expect, and blue shows words that are distributed more smoothly than wed expect.

Just at youd think, theres a lot more variation on the side of lumpiness than of smoothness. Were only going to look at the lumpy words from now. Lets zoom in further so we can see the individual words.

Not bad. Its a good sign when words as closely related as plaintiff and defendant appear right next to each other. You can also books using archaic pronouns, latin, religious language, and a variety of scientific and medical terms. Plus some colloquialismsany work with a lot of aints is probably a popular novel, though the reverse isnt true you and dont fall into this category as well.

What are the top 20 lumpily-distributed words, you ask?

plaintiff  defendant defendants      urine        bie     mucous   testator       acid   membrane

3.778093   3.467491   3.362822   3.172946   2.852449   2.840315   2.659488   2.630043   2.607742

adv  abdominal        fig        tov     glands  diagnosis        que    bladder     kidney

2.537471   2.482434   2.457205   2.442361   2.432046   2.426817   2.379019   2.360535   2.320224

quae    statute

2.266063   2.262361

This raises two points:

  1. Fig? Whats special about figs?

  2. Theres way too much medical stuff in here. If this method will help find words than can be useful in classifying, its going to require a better way to see which words are different _from each other_, not just from the bulk of words in the language. If we could just load all the words in at once, that would be easy. But as it is now, its going to be trickywe might want to, say, take the 250 most lumpily distributed words, and then take the 50 of those that seem to sort on the most different axes If the database gets running, it could probably take care of itself running on the computer overnight.


Comments:

Looking at this again, I realize fig.

Ben - Dec 6, 2010

Looking at this again, I realize fig. is a figure in a book with illustrations. Im embarrassed it took me so long to figure that out.