You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Bookcounts are in

Nov 11 2010

I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (Im just going to call them bookcountsI hope thats a clear enough phrase).

Roughly, wordcounts are how many times a word is said in my library, and bookcounts is how many different authors are saying itor how many different readers are reading it. (Given prolific authors, multiple authors, etc. thats not quite true, but its still an OK way to think about it). Since this is a quantitative blog, lets start with a chart. Here are the two different counts simply plotted against each other (please someone e-mail me if these image files dont come through as well as the earlier ones):

Each of the 200,000 points is a wordfrom the, all the way up in the upper right hand corner, to a whole morass of words weve all forgotten and typos, down in the lower left. The red line is the theoretical minimum a word appearing in exactly as many books as is its word count.* This is abstract, I know, so lets add some of the words weve already been analyzing to the chart to humanize it a little. (By the time we get through with this, my little linguistic studies the last few entries will have taken us back to history, I promise).

This is starting to tell us something. You can get a sense, looking at all the points, of what the overall curve is. (If you actually know the name of distribution, please please e-mail me right awayI want to fit a line to it, and Im at the limits of my knowledge. I might just do another loess, but theres definitely some logarithmic or polynominal or something function that will work better.)

Take railroad, evolution, and efficiency. All appear in roughly the same number of booksrailroad in just over 10,000, evolution and efficiency in just under. But they appear dramatically different numbers of times railroad appears 3.7 times as often as efficiency (note that the y axis is on a log scale). The places that railroads are discussed, they get mentioned a lot. For the record, telegraph is at the same place as efficiency on the scale; so its not about technologies vs. concepts or anything like that.

So what is it? Well, heres another way of thinking about the difference between the two. There are seven words that appear between 181,500 and 182,000 times in my library: heres how many different books they appear in.

absolutely        rapid occasionally     province        india     colonies     railroad
       24081        23437        22217        16776        16665        12128        10676

The first three are simply common words. The last four, though, are more interesting. Would anyone, besides the most committed imperial historian, have been willing to bet that colonies or railroad is as common a word as occasionally? I doubt it. Its certainly not true anymore. And even back then, any given writer was more likely to pull absolutely out of his vocabulary than colonies. But for historians, the last four words are far more interesting. The imperial overtones are mostly a coincidence, I thinkbut its clear that words may be, semantically, just more interesting as they occur in fewer books.

So why would words be concentrated that way? I can think of four reasons off the top of my head: please add some more in the comments.

  • Year. If you squint, you can see that most of the words weve been looking at are above and to the left of the main curve. Thats partly because weve picked out historically interesting wordsthat means that there are a lot of years where they arent used at all. I could make a chart like the above one, but only for the year 1910, say. If I did, I think much of the distinctiveness of a word like efficiency would be lost, since it was a common vocabulary word then, unlike in 1850.

  • Monograph-icity. Im not sure what to call this one, but its important. One of the things that makes a word like railroad stand out is that people actually write books about railroads, which use the term all the time. On the other hand, no one could write a book about rapid, and its hard to write a book about provinces in general. Mentions of a word will cluster in particular books based on how intensively you can write about the concept it describes.

  • Genre. This is one that Im really interested in. Some words are fairly specific to particular genres, and could even be used computationally to try to classify books according to genre. Science is one up there thats relevant, but see below for a few more.

  • Data Problems. See below: tlie and tbe, our two most common typos, stick out on this chart. Id guess thats because they appear disproportionately in books printed in certain fonts. We could probably, in fact, use this data to reconstruct the font families of many of these books based on the OCR data. Maybe thats my next blog.

About that genre thinglet me just leave you with one more version of the big chart at the top. Lets zoom in on the top right corner, and only look at words that appear in over 18,000 books. And instead of representing each word with a point, well actually put the words in. You look at the words that stick out to the left and above the main line, and then Ill give my take.

The most common ones are interestingyou,her,my,she,your. You, for example, is one of the most common words in the language, but there are about 1000 books that dont use it even once. Clearly, this has to do with genre and stylistic conventionscertain authors never address their readerand thats going to be disproportionately common in certain genres. Likewise, she and her. There are a lot of books that never once have a feminine reference.

But the numbers on words like that are driven up _a lot_ somewhere, because the words are still very common. For she and you in particular, Id wager that its the fiction that really restores women and the second person to their proper place in the language. I bet we could use some words like that as markers to judge what books are probably novels. (Maybe Ill get more into how to do thatsome combination of clustering and principle components analysislater.) I think this could contribute to discussions about femininity and the novel. It might even be possible (I wont do this, but maybe someone knows an English Ph.D.?) to extract everything within quotation marks in the presumptive novels, to analyze novelists changing representations of demotic speech. Wouldnt that be fascinating?

But most (well, all) of my readers are historians, and you probably really got interested in the words a little farther down the curve. Particularly government, hanging out there like the moon in space. Government is discussed, thats almost certainly telling us, fairly intensively in a number of books and not at all in some others. How is that distributed across the period? What are the other words in the books that use government disproportionately? I can see at least two other keywords from Dan Rodgers second book, people, hanging out in our space temptingly, and of course there are a number of other political words out there as wellconstitution, property, money. There are some semantic databaseseven one right here at Princetonthat class words by category, I think. It might be interesting to study the differences in these.

And at the very left, about fourth down, is species. If I hadnt started this project thinking Darwinism was a productive area of inquiry, this chart would have suggested it to me. Pretty cool. Can you see anything else on here that suggests other inquiries worth starting?

*(footnote from the third pp) Yes, I actually have a few words that fall below that theoretical minimum. Theyre mostly typos for either high roman numerals, or for the word illustrated. It has to do with the way I do my total wordcountsthey ignore words that appear alone in a paragraph, which the bookcounts dont. Here, were just catching some titlepage/table of contents errors. I dont think its a big problem, although I should fix it.

Theres also some messiness in the curve below wordcounts of about 5,000 thats also created by tricks my counting algorithm does to save RAM. Since were not dealing with those words ever, Im not worried about that right now either.

Comments:

What about the words that fall right in the thick

Jamie - Nov 5, 2010

What about the words that fall right in the thick of the curve? Im thinking of attention, which in the second graph looks like it has an average word- to bookcount. It might be interesting to do a year-by-year look at how it settled into its place in the curve: was it originally used a lot in a few books? Or sparingly by a lot of books? Thats a basic how-does-it-change question.

Also, when a word is used about as often as you might guess, thats probably good news for the philological/cultural analysis, since the question can open up to look at the words that surround the original word. Or to put it differently, I wonder if this gives you more permission to see semiotic changes in attention as significant.