You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Assisted Reading vs. Data Mining

Dec 30 2010

Ive started thinking that theres a useful distinction to be made in two different ways of doing historical textual analysis. First stab, Id call them:

  1. Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual readingfinding texts relevant to a topic, doing low level things like counting mentions, etc.

  2. Text MiningTreating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.

Humanists are far more comfortable with the first than the second. (Thats partly why they keep calling the second type of work text mining, even I think the field has moved on from that labelit sounds sinister). Basic search, which everyone uses on J-stor or Google Books, is far more algorithmically sophisticated than a text-mining star like Ngrams. But since it promises to merely enable reading, it has casually slipped into research practices without much thought.

The distinction is important because the way we use texts is tied to humanists reactions to new work in digital humanities. Ted Underwood started an interesting blog to look at ngrams results from an English lit perspective: he makes a good point in his first post:

What puzzles me about humanistic disdain for the ngram viewer is that it often seems to presume that a piece of evidence must be legible in itself naked and free of all context in order to have any significance at all. If a graph doesnt have a single determinate meaning, read from its face as easily as the value of a coin, then what is it good for? This critique seems to take hyper-positivism as a premise in order to refute a rather mild and contextual empiricism.  [Humanists] fear that the superficial certainty of quantitative evidence will seduce people away from more difficult kinds of interpretation.

I think theres a lot to this view; Ive been trying to say some similar things from time to time. Word graphs like those in ngrams are just another kind of historical evidence. Yes, they require nuanced, contextual interpretation. But thats no different than other sorts of evidence. Graphs give us new texts to read, give a new platform for meditations and reflections, give a new sort of source to interpret.

And yet. I think the real fear is not about difficult vs. easy interpretation. Its about the privileged place of a particular form of reading in the humanities. Historians, in the Rankean tradition, pretty much read documents. Sometimes they read stained-glass windows or maps of archeological sites or advertisement illustrations, but those departures dont challenge the primacy of texts in the field. Humanistic disciplines preserve and elaborate traditions of reading different types of artifactswhether theyre poems, paintings, music, or diplomatic cables. That expertise is central not only to the disciplines, but to the self-identity of lots of humanists themselves.

Text mining produces completely different artifacts to read. We get summary tables, charts, line graphs. In the case of ngrams, theyre almost severed from traditional books. Progressive professors like Underwood can try to read them, but the practice is quite different from looking at text. I think he gives more evidence to my claim that poststructuralist theory (although he says structuralist, being a little more interested in referents than I am) has to some degree prepared us to read these sorts of artifacts better. But at the extremes, the temptation with the new data is to model rather than to readto chart out half lives for fame as in Science, or to plot the prominence of presidents centennials like I did. This is fun, but its not clear how useful. Or more precisely, who its useful for. (Maybe theres a market in parts of the culture industry for macroculturnomic forecastingstudios wanting to know if zombies are on their way out, etc.)

As a result, text mining is something of a challenge to the humanities, because it seems to promise to obviate their ways of readingsuddenly understanding Dirichlet distributions becomes more important than having a sophisticated ear for meter or an understanding of rhetorical conventions. Humanists love to complain about the decline of the humanities and their increasing exclusion from culture: theyre well primed for heavily negative responses against pure text-mining approaches. We can scold those reactions away as grouchy or Luddite, but that misses the pointold guard humanists are right that their ways of reading are often designed to facilitate interactions between two peoplethe creator and the readerand any programming solution that gets between the two, however ingenious, misses the point of what the humanities offer over the social sciences. Unless we want to reproduce the split within anthropology in all the humanities fields, theres no reason to clamor for the fight.

Assisted reading, on the other hand, is a much easier sell. As I said, search has been adopted without much thought, because it reinforces existing patterns of reading. It deprecates some of our expertise, to be sure: but those are mostly older research practicescard catalogs, letters to experts, treks to periodicals reading rooms after every footnotethat humanists are much less invested in. The problem with assisted reading is that most humanists regard not far removed from magic, and certainly dont engage in designing tools to do it themselves. This, I think, is one of the most important problems for the digital humanitieshumanists use digital resources all the time, but are quite naïve about how they work and thus unaware of the potential to get more out of them. As a result, our resources are arranged in ways that make it far harder for us to use them. Aside from a few longstanding gems like the Perseus Project, we arent involved in the ways that our resources go digital, and they end up in places like Jstor with only one, suboptimal, way of getting at them. Ive been thinking for a while about what humanists need to know about database design that they might nothopefully Ill finally post that sometime soon.

I dont think were stuck here. Some work slightly more sophisticated than artfully constructed search terms could really help to continue to demonstrate to humanists how digitization benefits them. (Im sure theres a lot of this out there: but let me spin my lack of immediate examples as typical rather than merely embarrassing.) That sort of work makes the path to more sophisticated methods, even with non-textual outputs like charts and graphs, more palatablesomething from inside the field, not an imposition from outside. Text mining and assisted reading are extremes on a spectrum, not discrete categories. (Im sure its clear by now that I think that about everything from genre to authorship, but it still bears repeating.) Assisted reading *does _rely on computers to dispose of many texts completely, and text mining always retains _some* lexical information, however heavily translated, at the end. The more work we can get in the middle of the spectrum, not just at the extremes, the better off well be.

Comments:

Interesting. Theres some stuff in there (seri

Ben - Dec 4, 2010

Interesting. Theres some stuff in there (seriously, double PhDs? Arent Harvard grad students spending enough time in Cambridge, already?) that Im not sure I agree with, but theres definitely a need to apply the nuanced understandings of science from science studies on the humanities. Ive been thinking a lot about the epistemology of error and how difficult it is to get humanists to accept imperfection.

Are historians of science better digital humanists? Dan Cohen does history of mathematics, of coursebut in general, it seems like its English depts. with comparatively less critical engagement with science that have really gotten on board the text-analysis train, than historians who have had some exposure. I dont really know, though. Hank? Dan?

Heres my reach for a concrete example of a po

Allen Riddell - Dec 5, 2010

Heres my reach for a concrete example of a possible gain. I think understanding a bit about Cantors discoveries in mathematics really could deepen an understanding of late 19th century intellectual history (if only to get a sense of what all the excitement was about). And anyone who learns a bit of probability theory may well have to wrestle with infinite setsand maybe even the Cantor set.

I think this post has all the characterstics of th

Clipping Path - May 5, 2014

I think this post has all the characterstics of the best post. Thank you a lot, man, for that.

Nice work. Programs De administration De Pincas..

Clipping Path Service - Sep 3, 2014

Nice work. Programs De administration De Pincas..