You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Posts with tag Ngrams


← Back to all posts
Apr 03 2014

Heres a little irony Ive been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a book is in ways that makes the selection criteria for Ngramsif it made it into print, it must have _some _significancecompletely meaningless.

Jul 15 2011

Starting this month, Im moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and Im very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:

Apr 13 2011

All the cool kids are talking about shortcomings in digitized text databases. I dont have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream intereststhe neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that its not just at the margins were missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Heres an example.

Jan 21 2011

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. Ive called 1922 the year digital history ends before; for the kind of work I want to see, its nearly an insuperable barrier, and its one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

Jan 20 2011

The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. Ill get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.

Dec 30 2010

Ive started thinking that theres a useful distinction to be made in two different ways of doing historical textual analysis. First stab, Id call them:

Dec 23 2010

Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. Theres just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, theres no reason to use the ngrams data rather than just downloading the original books, because:

Dec 19 2010

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

Dec 18 2010

As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and youre pretty much guaranteed an explosion of theories and methods.

Dec 17 2010

(First in a series on yesterdays Google/Harvard paper in Science and its reception.)