You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

In search of the great white whale

Apr 13 2011

All the cool kids are talking about shortcomings in digitized text databases. I dont have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream intereststhe neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that its not just at the margins were missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Heres an example.

Thanks to a question from Hank in the comments on a previous post, I went looking in my database for books by Herman Melville. I noticed that most of the ones I have are published in the 20th century. Thats not surprising, given Melvilles obscurity in his lifetime. Still, Id like to have the books for any study of 19th century culture. The lack of first editions has bothered me before with Mark Twain; I actually changed my publisher list last time I remade the database so it would catch some of his works with obscure presses. But even though Melville usually published with Harper, Open Library only has a few Melville texts published in his lifetime that meet my metadata criteria: two 1847 Typees, one 1849 Mardi, one 1856 Piazza Tales: thats it. No Moby-Dick, no [Billy Budd](http://openlibrary.org/works/OL102746W/BillyBudd),_ and only a microform copy of the Confidence Man without a library call number. HathiTrusts earliest copy of Moby Dick, just like OLs, is from 1892. Google Books interface is less well FRBRized, but a search for the first edition of Moby Dick shows mostly the bad initial reviews, along with a number of empty bibliographic records Google doesnt seem to realize refer to the same edition. The list that search returns for me starts with one that declares Moby-Dick a joint project among Melville, Mark Twain, and Mortimer Adler. (I assume Adler wrote all the pedantic bits about whale biology, and Twain cashed his check as soon as he wrote the bit about knocking peoples hats off in the street.)

Why no Moby Dick in the libraries? Heres my guess. The Google book digitization project is the only source for Google Books, and the biggest for Open Library and Hathi. Weve got different sources, but theyre all using the same library books from the same scanning sessions. Think about how the Google scanning project worked. They set up cameras in various library collections and started scanning books shelf-by-shelf, I believe, which is a sensible way to get a bunch of digital texts in a hurry. But theres a catch: no university library would possibly still have a first edition of Moby-Dick on its shelves in the 2000s. Any good library would have moved it to rare books long ago. If they didnt, some enterprising undergraduate would have snatched it up to pay a year or twos tuition. Its a cultural artifact, a prince among books: its too important to leave among the plebes in the stacks.

So just because the first edition of Moby-Dick is such a cultural touchstone, just because we want to preserve it so much, it wasnt among the first 10 million or so volumes we put in our most important digital libraries. Perhaps some collection did their own scan of Moby-Dick in the early days of digitization. Id be surprised if not. But if so, it isnt easy to find: Yale seems to have given up after two pages on the copy in Beinecke, and thats the only thing Im turning up on Google. Any academic-led scanning project might well have started with this book; but the quantity over quality approach that Google Books has used means we still dont have it easily accessible. Its not the only one. The Adventures of Huckleberry Finn exists in the Bodleian Library copy of the 1884 first British edition, but the first American edition (1885) seems to be missing. The Bodleians indifference about American classics seems also to be responsible for Google Books copy of the Confidence Man, not present in Open Library, but a lot of books are missing entirely: theres no [Tom Sawyer](http://www.google.com/search?q=Tom+Sawyer&hl=en&sa=X&ei=ZQ2lTdyrOMm10QGhlp3-CA&ved=0CBYQpwUoBA&source=lnt&tbs=cdr%3A1%2Ccdmin%3A1876%2Ccdmax%3A1876&tbm=bks), either, no Origin of the Species until 1861 Im sure the list goes on.

Thats ironic, but also a neat little parable about how how the touchstones of the mid-century academy are approaching the Internet. Were so focused on preserving the book that most contemporary academic research in the humanities is as inaccessible as its ever been: journal articles available only from within university campuses, and books not available online at all. Since scholars havent been heavily involved in putting things online, not only the first edition of Moby-Dick but most of the current scholarship about Moby-Dick is still nearly invisible on the Internet. Protecting the culture of the past to be used like it always has been means excluding it from new currents of consumptions.

Thats a neat story, but is it really a big deal? For some text analysis, to be sure, this is a bit of a pain. Id really like a complete run of Melville or Twains works to compare to the language of their contemporariesbut lacking the first editions, I have to rely on later publications. In extreme cases of late rehabilitation like The Confidence Man, Open Library has only that one oddly cataloged microfilm copy before the copyright cutoff: and where there are public domain copies, it requires really good FRBRization to be able to get the original publication year. And of course, as I said about Tom Sawyer and Google ngrams a while ago, it may be hard to sell humanists on the idea that text analysis measures the entire culture when its missing its most central documents.

But for the real distant reading stuff, I dont think it matters much. For any project that actually takes advantage of what digital reading allows, quantity matters far more than quality. If we wait for nice TEI editions of all books to show up, it will be decades before anyone could leverage the most interesting techniques one can use on large bodies of texts. Either youre doing Melville studies, in which case you can just add a copy of Moby-Dick to your database, or youre not, in which case a few nautical terms here and a few whale skeletons there arent going to change the language very much. Any study thats results would be changed by a couple books is probably trying too hard to wring evidence out of a small sample size. As long as Im right that the really famous books are missing just because theyre so famous, and not because the database is completely ridden with holes, the general picture of the language should be fine. At some point, Id hope Google, Hathi, or Open Library would take the lead in scanning books from rare-books librariesI cant imagine, honestly, that the last library will actually be missing these texts for long. But I would be surprised if there were more than a few dozen books in the 19th century that meet Moby Dicks criteria of incredible modern value and original obscurity.

Nonetheless, its a helpful reminder that weve got a long way to go before we can talk about comprehensive book digitization. Our current collection of texts is skewed in all sorts of strange ways. Not only by library collection patterns, but by where they keep keep their physical books, by what books are easier to enter consistent metadata for, by how certain authors reputations waxed and waned This is all more evidence that were just beginning to get a sense of how our big digital libraries differ from our old stone ones. And that without checking each others work for mistakes of the type that only specialists, or librarians, or archivists can catch, we might find ourselves in some uncomfortable situations.

Comments:

Great post, Ben, that raises important questions n

Natalie Houston - Apr 3, 2011

Great post, Ben, that raises important questions not only for DH/text-analysis studies but also other academic researchers using these repositories. The 1st ed of Moby Dick is included in the Wright American Fiction collection, as are other canonical heavy-hitters. It would require a bit of extra effort to download what you want, but you could get it there I think.

John Overholt from Harvard e-mails me this comment

Ben - Apr 3, 2011

John Overholt from Harvard e-mails me this comment which blogger wouldnt let him post:

~~~~
Youre absolutely right about why the 1st ed. of Moby Dick hasnt been digitized: at Harvard, all the copies are at Houghton, and we didnt participate in the Google scanning. We are digitizing more and more of our collections, but our top priority is unique manuscript items, rather than a (comparatively) widely held printed book. Its almost a sort of tragedy of the commons situationeverybody is waiting for somebody else to digitize it. Youve also touched on a major problem with the digitization being done by libraries around the worldits very difficult to search for which library might have happened to digitize the particular work you want.

You might be interested to know that we have digitized Melvilles Billy Budd manuscript and his copy of the Essex narrative, an important source for Moby Dick.

http://nrs.harvard.edu/urn-3:FHCL.Hough:4686413

http://nrs.harvard.edu/urn-3:FHCL.Hough:2641693

Great catch.

Allen Riddell - Apr 3, 2011

Great catch.