You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

digitizecr by ljooq ic

Nov 10 2010

Obviously, I like charts. But Ive periodically been presenting data as a number of random samples, as well.  Its a technique that can be important for digital humanities analysis. And its one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its ownits just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dullone university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, theres real meaning embodied in every point, that were far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We cant read everything ourselves, but its good to check up periodicallythats why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.

So any good text processing application will let us delve into the individual data as well as giving the individual picture. Im circling around something commenter Jamie said, though not addressing it directly: (quote after break)

But it seems like [graphic presentation] is also an opportunity to present in more compact form the interpretive difficulties and possibilities that historical research involves more generally. The problem of knowing what sources are responsible for word peaks is an efficient way to complicate the issues of source and structure; and the follow-up investigations into genre suggests why the business of complicating might be useful or interesting, in addition to being a responsibility.

This is very true. Although Ive been selling the loess curves and so on, its worth remembering that every interpretation of these charts is, ipso facto, a completely traditional act of historical thinking.

And if we program it right, it can be a very good illustration, or better, simplification, of the historical process, to just drill straight down. I cant do it yet, but wouldnt be great to not be stuck speculating about the reason for that 1869 telegraph spike, but instead to just skim fifty of the five hundred sentences that use the word? Thats sort of what we use Google Books for right now, but the very randomness can both reacquaint us with the weirdness of our sources, but also open us up more to the vagaries of the real world. One of the things thats good about the acts of interpretation charts force on us is that theyre disconnected from other historical discoursesthey can structure our answers, but can only make limited contributions to the questions we can ask. Thats actually one of the reasons that I strongly urge everyone to not use decade counts for this sort of analysisone of the things that can hold us back in interpretation is what we think we know about, say, the 1890s as the decade of turmoil and the 1900s as the decade of reform.

One of my other long-term history data goals has been to figure out how to use historical census data to randomly pull out some number single americans from a given year. Where are they from? What do they do for a living? Who do they live with? Sometimes in the classroom we assign students identities that we take to be representativeone is the SNCC activist, another the Goldwater voter, another the union laborer. But what if we actually pulled the names out of a hat at the beginning of a class? That would be an interestingly different perspective on what America looked like. Might work as a short writing exercise, too. It would certainly be an interesting challenge for the teacher to cope with whoever actually came out.

*****
Rather than make a post out of it, Im going to stick some bookkeeping at the end.

I said yesterday that Id have data on how words were distributed across different books today. I did, but it highlighted some more problems with the OCRthe most universally distributed word was i, which is acceptablebut number 2 was Google. Most of the reason was that I wasnt stripping their intro page as effectively as I thought, which Ive changed. But there are also a lot of watermarks scattered through the books, which are always returned slightly differently, in phrases like the title of this post. So I had to go back and edit my pre-processing script, which does things like strip out whitespace, put words that are strung across a line-break by the printers back together, and that sort of stuff. I added a couple of other important to that script too based on what the texts look likenow I strip all possessives before processing so they show up as singulars, not plurals; I added a few more cases of periods that dont indicate the end of a sentence (mostly following common abbreviationsmr, prof, rev, and so on); I skip any line thats more than 50% capital letters, on the theory its probably a table of contents or, much worse, a chapter heading or book heading thats repeated at the top of every page. (There are probably some others of those in lower-casethey would be a _real_ pain to eliminate, and part of my contention in doing this is that we can see interesting things even without perfect data, which is pretty far off). (Footnotes are a mess toothey interrupt sentences in the middle, are full of periods that dont signify sentences, and their actual fact of referencing is hard to pull out. I wonder if Googles OCR is much better than this.)