You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Dunning Statistics on authors

Oct 07 2011

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpusestwo history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenoteinteresting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

As an example, lets compare all the books in my library by Charles Dickens and William Dean Howells, respectively. (I have a peculiar fascination with WDH, regular readers may notice: its born out of a month-long fascination with Silas Lapham several years ago, and a complete inability to get more than 10 pages into anything else hes written.) We have about 150 books by each (theyre among the most represented authors in the Open Library, which is why I choose it), which means lots of duplicate copies published in different years, perhaps some miscategorizations, certainly some OCR errors. Can Dunning scores act as a crutch to thinking even on such ugly data? Can they explain my Howells fixation?

Ill present the results in faux-wordle form as discussed last time. That means I use wordle.com graphics, but with the size corresponding not to frequency but to Dunning scores comparing the two corpuses. What does that look like?

Words overrepresented in Dickens vs Howells:


We get a bit of the British orthography, but less than Id fear; and we do get a number of insights into Dickenss style (appearance,merry,little,bright,eyes) as well as some interesting social distinctions probably more reflective of the US vs. Britain and the mid vs the late 19th century (gentlemen, gentleman, wot, coach).

Words overrepresented in Howells vs. Dickens:


Howells is dominated by two things compared to Dickens. First, that enormous looming she that denotes a significantly larger proportion of female characters, and clusters around it of words like mother, girl, girls; second, a string of short, common words that reflect the more pedestrian American style compared to Dickensian specificity, including a great number of fragments from contractions (don from dont, isn from isnt, and probably ve from couldve,shouldve,wouldve?). One gets Howells literary interests as well with literature, literary, etc., although Id have to further refine to see if they came from criticism or from the inclusion of novels themselves as plot points in books like Silas Lapham.

How is reading these texts the same as or different than comparing Dickens and Howells themselves? Its not quite what Ive had expected on the Howells side: he comes off with an Austinian directness (Jane, not J.L.) that doesnt match my expectations. In some ways, Id say that the comparison tells us far more about Dickens than about Howells.

Just how becomes clear when we compare Howells to a more comparable figure, Henry James. Howells still overuses common words, but now appears overly masculine in his character choices, and fond of and conjunction (while Dickens used and significantly more than Howells)

Words overrepresented in Howells vs. James

Words overrepresented in James vs. Howells

Again, I feel like this more closely captures the distinctive qualities of James (moment-companion-charming-extraordinary-view) than of Howells. Even Howells distinctive points (lots of boys?) can be seen as reflections more of Jamess attributes (endless portraits of ladies). Id use it as a sort of evidence for the phenomenal blankness of Howells; its one of the reasons hes an interesting source. (Dan Rodgers once told me he read a lot of Howells one summer to get a better sense of the late 19th century, since James was just too good to portray it blankly.)

But: were firmly in the fun-with-wordle camp right here. This is not even senior thesis material; I wouldnt count myself qualified to make good English dept. pronouncements about different authors. But I would saythe algorithm seems to be doing a reasonable job using hundreds of minimally processed books to make meaningful distinctions here, and thats for the good. Perfect OCR isnt necessary to get started on this if we know what were looking for.

Of course: what are we looking for? Some more database-building for me now, and well try to get there soon. If anybody can think of some great corpus-comparisons theyd like to see, let me know.

Comments:

Great stuff, as usual. I agree that its possi

Anonymous - Oct 5, 2011

Great stuff, as usual. I agree that its possible to do a lot even with pretty uneven OCR. I know because sometimes after getting good results out of a corpus I go back and discover that I made a silly error in processing that should have really noisified and flattened the results. I dont know how to quantify this yet, but my gut feeling is that certain methods (clustering/LSA, for instance) are pretty robust and tolerate a great deal of noise.

Ive actually been thinking about Dunnings lately too. I was put in mind of it by a great article a couple of months ago by Ben Zimmerman addressing the character of literary diction in a given period (i.e., Dunnings on a fiction corpus versus the broader corpus of works in the same period).

Id like to incorporate a diachronic dimension to that analysis. In other words, first take a corpus of 18/19c fiction and compare it to other books published in the same period. Then, among the words that are generally overrepresented in 18/19c fiction, look for those whose degree of overrepresentation *peaks in a given period* of 10 or 20 years. Perhaps this would involve doing a kind of meta-Dunnings on the Dunnings results themselves!

Or, if the literariness/fictiveness of

Anonymous - Oct 5, 2011

Or, if the literariness/fictiveness of a word changes as steadily and gradually as Im hoping, it might be possible simply to do time series graphs of the Dunnings log likelihood statistic for a given word over the course of a century, by using a sliding temporal window while keeping the generic contrast constant.