You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Digital history and the copyright black hole

Jan 21 2011

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. Ive called 1922 the year digital history ends before; for the kind of work I want to see, its nearly an insuperable barrier, and its one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

The Sonny Bono Copyright Term Extension Act is a black hole. It has trapped 95% of the books ever written, and 1922 lies just outside its event horizon. Small amounts of energy can leak out past that barrier, but the information they convey (or dont) is miniscule compared to whats locked away inside. We can dive headlong inside the horizon and risk our work never getting out; we can play with the scraps of radiation that seep out and hope it adequately characterizes whats been lost inside; or we can figure out how to work with the material that isnt trapped to see just what we want. Im in favor of the latter: let me give a bit of my reasoning why.

My favorite individual ngram is for the zip code 02138. It is steadily persistent from 1800 to 1922, and then disappears completely until the invention of the zip code in the 1960s. Can you tell whats going on?

The answer: were looking at Harvard library checkout slips. There are a fair number of Harvard library bookplates in books published before 1922, and then they disappear completely because Harvard didnt want any in-copyright books included in the Googleset. Thats not true of all libraries; in fact, you can see that University of Michigan book stamps actually spike up to make up for the Widener shortfall from 1922. Type in some more library names, and you can see a big realignment right at the barrier year:

So what? There are probably some differences in library collections, scanning conditions, book quality, etc., that might induce a little actual error into ngrams. But its not enough to worry about, Im willing to bet. My point is more general; the jump highlights the fact that we basically have two separate archives for digital history. You can stitch them together, but the seams remain if you know where to look for them; and we need to think about how to use those two archives differently. Because one of them is a heck of a lot better than the other.

Thats not obvious from many of the digital tools built for humanists so far. A lot of our academic databases pretend these disjunctures dont exist, as if the information that seeps out past the collapsed star of the American publishing industry is all scholars need. Ngrams treats all of history equally. Jstor doesnt give you any more access to pre-1922 journals than it does to ones after. With institutions like worldcat, were letting some of our old data accrue a trail of metadata inside the event horizon that starts to drag them down, away from true open accessibility.

But if we dont think about this line and build our plans around it, well miss out on the most exciting possibilities for digital humanities work. I am convinced that the best digital historynot the most exciting, not the most viewed, but the most methodologically sophisticated and the most important for determining whats possibleis going to be done on works outside the copyright event horizon. Everything exciting about large-scale textual analysistopic modelling, natural language processing, and nearly anything Ive been fiddling with over the last couple monthsrequires its own special way of breaking down texts. For the next few years, were going to see some real progress on a variety of fronts. But we need to figure just what we can get out of complete texts before we start chopping them up into ngrams or page snippets or digital books you can check out a chapter at a time. And since we have the complete texts only outside the black hole, thats where were going to figure it out. Well certainly keep trying things out on the scant information that escapes, but our sense of how well that works will be determined by what we can do with the books we can actually read.

Thats already happening. The Victorian Books has picked just about exactly the years you would want to use, which is one of the reasons its potentially exciting for history. (Although the Google metadata, like worldcat, seems to have a trailing edge inside the event horizon.) The Stanford folks seem to stand a little more surely outside on the 19th century side of the digital divide, though I admit I still dont have a good idea of other history (thats my main interest, of course) being done on the largest scales. The MONK datasets seem to have restrictions of some type? (I should probably do my research a little better, but this is a polemic, not a research paper.) In American history, the period with the richest textual lode is the Gilded Age-Progressive Era, which has been searching for a reformation of the historiography for years. If I were to stay in the profession, theres no question thats where Id be most inclined to plant my flag. In any case, I think of this blog in large part as a place to figure out roughly what Id like to make sure someone gets to do; what tools and techniques seem promising or interesting for studying changes in language over time that certain restrictions on texts might preclude.

But as services spring up, like ngrams or the Jstor data for research (I thinkI havent quite figured that one out yet), all this diversity tends to get collapsed into just one type of measure. Convenient web interfaces giveth, convenient web interfaces taketh away. One of the more obscure things that troubles me about ngrams, as I said, is that it papers over the digital divide by enforcing the copyright rules even on the out-of-copyright material. In a way, I think of the kind of tokenized data that the Culturomists managed to cajole out of Google as not just a database but also an anonomyzing scramble, like on a real-crime show. The Genomics parallel works well hereexcept that while individual privacy needs to be protected in genetic databases, there is no compelling reason to hide away information about the lexical makeup of books. I heard that at the AHA, Culturomics said they had trouble getting anyone at Harvard to host even the ngrams datasets for fear of copyright infringement. That is a) completely insane; b) sadly believable, and c) appears to suggest the hugely aggregated Google ngrams data about as far in the direction of openness as we can get along the many-word-token line.

Maybe, though, the approach the Culturomists take to dealing with the copyright period isnt the best one. At the least, it isnt the only possible one. Since not everything is subject to the crazy distortions of reality that apply inside the black hole, we can find that out. Thats why investigation into the pre-copyright texts is the most important task facing us over the next few years. Before the service providers decide what sort of access we get to books for the next decade, as they did with journal articles around 1995 or whenever, digital humanists themselves need to decide what we want. One of the things I think we can learn by looking at the pre-copyright texts is just what sort of data is most useful. For example: I can derive sentence-level collocations data in my database for 30,000 books, but for the last month or so I havent found myself needing to use it much. Instead, I just use correlations for word use across the books and multi-word search. Maybe when I actually try to write a paper, Ill find that I need the sentence data again. But if not, maybe just the metadata-linked 1-grams would be enough. Could that slip out of the black hole? I dont know.

(In practice, it seems insane to me that the wordcount data for a book would be copyright protected. In my fondest dreams, Id like to see the AHA push the boundaries on some of these copyright issues a few years down the road. They could post a number of post-1922 tokenized texts on its website to provoke a lawsuit and possibly clear out just a few types of lexical data as fair-use. But then again, most things about copyright law seem insane to me, and were a long ways off from any organization caring enough about the digital humanities to go to court to defend its building blocks.)

So long as we fully appreciate what we have in the public domain, the overall situation looks pretty good. The black hole isnt expanding; and we may be able to get quite a bit of information over that event horizon yet. Plus, the possibilities for all kinds of interesting work exist given the amount of data and metadata thats in the true public domain. As our hard-drives grow and processors speed up, it gets increasingly feasible to deal with massively large bodies of text on small platforms. There just arent that many books published before 1922; a laptop hard drive could almost certainly now fit a compressed Library of Congress, which wasnt true even five years ago. Im trying to adapt my platform to use more of the open library metadata I just linked to, and its an embarassment of riches.

Further to the good is that nearly all the active players are on the side of the angels. Most digital humanists themselves, from the Zotero commons to the Culturomics datasets, are eagerly promoting a Stallman-esque freedom. Even the corporation involved is Google, almost certainly the multinational with the best record on issues of access. Wed all be using a few Project Gutenberg texts if Hathi, Internet Archive, and everyone else didnt have their scanned PDFs and those of the projects that tried to catch them.

I have two fears, though. The first, which Ive already talked about a bit, comes from the technical side. Im afraid we might let the perfect be the enemy of the good on issues like OCR quality and metadata. The best metadata and the best OCR are probably going to be the ones with the heaviest restrictions on their use. If the requirement for scholarship becomes access to them, we either tie our hands before we get started, restrict work to only labs that can get access to various walled gardens, or commit ourselves to waiting until teams of mostly engineers have completely designed the infrastructure well work with. Things are going to get a lot better than they are now for working with Google or Jstor data. We want to make sure they get better in the ways that best suit humanistic research in particular. And we want to make sure text analysis is a live possibility on messier archivesdigitized archival scans from the Zotero Commons, exported OCR from newspaper scanning projects, and so on.

My second worry is that most historians, in particular, just arent going to get on board the Digital Humanities train until all the resources are more fully formed. As a result, we might not get historian-tailored digital resources until the basic frameworks (technical, legal, etc.) are already fixed. Historians, I do believe, read more of the long tail of publications than anyone. But theres an ingrained suspicion of digital methods that makes historians confess only in hushed whispers that they use even basic tools like Google books; and at the same time, a lot of historians, particularly non-computer-friendly ones, carry an intrinsic sympathy for the makers of books that leaves them not to regard pushing the envelope of copyright law as a fully noble endeavor. (Coursepacks excepted). With physical books, the 1922 seam isnt nearly so obvious as with digital texts; as a result, the importance of pushing the copyright envelope isnt always clear.

So part of the solution for making the archives safe for digital history is getting the profession a bit more on board with digital history, particularly old-fashioned humanities computing, in general. Thats doable. Even our senior faculty are not eternally trapped in the icy ninth circle of the Sonny Bono black hole, where the middle head of Satan eternally gnaws on a cryogenically frozen Walt Disney. They just need some persuading to climb out. I talked to Tony Grafton for a while this week about his plans to bring the AHA into the digital ageI think, after all the reports about the reluctance of historians to use anything more than basic tools, were actually on the verge of getting somewhere. But just how he and and the rest of the vanguard will pull them along is one of the trickiest and most interesting questions in the digital humanities today. Thats one of the things I want to start to think about a little more next.

Comments:

I feel honored to be even indirectly included in t

Jamie - Jan 5, 2011

I feel honored to be even indirectly included in the same paragraph as a cryopreserved Walt Disney.

thank you share article amazing

Obat Perangsang - Apr 4, 2014

thank you share article amazing Obat Penghilang Tato