You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Openness and Culturomics

Jan 20 2011

The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. Ill get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.

That issue is deeply tied to some two-cultures questions about just what openness means. Matthew Jockers and Ted Underwood have been calling for a full release of the list of books behind the ngrams dataset. Culturomics (thats as clear as I can be on authorship, unfortunately) says they have not received permission yet to release the list of 5.2m books behind the set. I assume thats because Googles metadata is subject to proprietary restrictions from catalog aggregators and publishersit will be interesting to see how they get out of that. Depending on whats in the metadata and its release format, that could range from barely readable to quite interesting.

On the other hand, as Culturomics points out, they have been commendably open with the ngrams data. They do far more than historians would have (I suspect) to make their experiments easily replicable, and their pages seem to indicate a plan to release more and more better cleaned data as time goes on. They seem to view the repository of data theyre setting up as a field-changing contribution that will drive research in the quantitative study of culture. Files of the magnitude theyre putting out are only possible for a very few organizations. Google is certainly the best positioned of those. If so, theyre right to be so proud of their openness, and theyre also right to have put that ahead of the bibliography. Cleaning and tokenizing textual data is a dreary task with enormous returns to scale, and if everyone needs basically the same dataset, research can move ahead faster with it even before the exact details are known.

But then again: if everyone needs the same dataset. Thats a huge caveat, and it certainly isnt completely true. Some people want part-of-speech tagging on a representative corpus, and theyll want COHA or its descendants. Some people want very precisely edited texts of relatively canonical works, and theyll use MONK or WordHoard with the highly edited and tagged texts that come out of what Martin Mueller is calling digital lower criticism. Part of what happened when Culturomics came out to somewhat reserved enthusiasm is that the people interested in computer textual analysiswho already have systems in place and a clear idea of their needsquickly realized that it didnt do what they needed, and that for many tasks (comparing versions of the folios? full part of speech tagging?), it might never make it.

At the same time, there are a lot of humanists who are still unclear on what, if anything, they can get out of lexical statistics. Some found ngrams eye-opening; some just found it fun; and some, I think, were put off a bit. First, by the scientific packaging; and second, by the lack of traditional humanist niceties like a bibliography, or some historiography, or a clear phrasing of what existing problems ngrams will solve. Since all humanists, by union contract, own the exact same 2008 white MacBook I have, most couldnt do anything much with the gigabytes of text files offered for download. You certainly cant open it with Excel to find what you wanteven to get a basic wordcount for a span of years basically requires some sort of program. Casual humanists would probably be happier with less openness and more claritytoo much information can seem like a way of stonewalling, particularly when the information you most want isnt necessarily there. In any case, what they didnt necessarily get is a sense of the immediate applicability to live questions in the humanities.

Thats because, in part, of the different types of openness. The openness of the sciences is based around replicability of experiments with relatively constrained goals; whereas currently, no one knows just what were headed for in the humanities. (Particularly in history, which I promise a lot more about  later.) The openness of culturomics will let a thousand flowers bloom, but only in one type of research. Of course there will be other types: but the combined cultural capital of the route this project (Harvard-Google-Science-New York Times) makes it important to be clear that the release of centralized datasets, a la genomics, is not the long-term solution to allowing digital history.

I should be equally clear, though, that it is a long-term solutionsome sort of baseline data is incredibly useful for all sorts of textual analysis, and whatever Google provides will probably be the best we get. Ive had a little trouble so far figuring out how to use the Culturomics data to clean up my own dataset (largely because I dont want to adopt their ways of using capitalization and apostrophes for memory-saving reasons) but with a larger dataset, those sorts of problems should melt away. As they role out some more sets of genres with better metadata, it will provide an amazing group of genre baselines for comparison of more localized texts.

But given the limitations of ngrams (using the word generically) data, I tend to think that datas usefulness will rest not only in its openness a la genomics, but in its ability to complement other datasources. If Google ngrams is the best solution we can come up with for linking book metadata to textual data, we are going to be largely restricted to the studies of fame and repression in culture that the culturomists have been releasing so farstudies that dont get to the core of most of the historiography which rely on a lot of different ways of thinking about how the web of language intersects at various levels.

Openness in the digital humanities needs to be about interoperability as well as replicability. Ngrams is stellar on the second, and merely good on the first. Moving forward, I wonder how we can do better.

I think thats all I really have to say for now aside from a couple reflections about copyright Ill post in a bit. But I should put up their response to the problem of lack of humanistic involvement in their project that I worried about, though I wasnt the first (nor was Menand, Im sure):

  1. Why were there no humanists involved in this project?

Thats incorrect. Erez studied Philosophy at Princeton as an undergrad and did a masters degree in Jewish History working with Elisheva Carlebach. Two of our other authors, Joseph Pickett (PhD, English Language and Literature, UMichigan) and Dale Hoiberg (PhD Chinese Literature, UChicago) are the Executive Editor of the American Heritage Dictionary and the Editor-in-Chief of the Encylopaedia Britannica. In addition, we were in contact with many humanists throughout the life of the project.

But more than just wrong, its irrelevant. What matters is the quality of the data and the analyses in the paper and what it means for how we think about a great variety of phenomena - not the degrees we happen to hold or not to hold. If what we seek is a serious conversation about this work, we shouldnt exclude anyone who has something significant and thoughtful to say. That would be a shame.

When I was researching a section of the Humanities Indicators about the Humanities Workforce, one of the priorities was to be inclusive about the range of occupationseditors, secondary teachers, journalists, not to mention archivists, librarians, museum curatorswho were professional humanists without a research university chair. Theres no bright line, and thats good. I certainly dont want to write anyone out peremptorily.