You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Catalog data and genre

Dec 01 2010

Mostly a note to myself:

I think genre data would be helpful in all sorts of waystracking evolutionary language through different sciences, say, or finding what discourses are the earliest to use certain constructions like focus attention. The Internet Archive books have no genre information in their metadata, for the most part. The genre data I think I want to use would Library of Congress call numbersthat divides up books in all sorts of ways at various levels that I could parse. Its tricky to get from one to the other, though. I could try to hit the LOC catalog with a script that searches for title, author and year from the metadata I do have, but that would miss a lot and maybe have false positives, plus the LOC catalog is sort of tough to machine-query. Or I could try to run a completely statistical clustering, but I dont trust that that would come out with categories that correspond to ones in common use. Some sort of hybrid method might be bestjust a quick sketch below.

A substantial fractionIm sure how many, but probably in the 10-40% areaof the books on the Internet Archive site have associated marc records, like the following: http://ia341018.us.archive.org/2/items/orbsaroundusase01procgoog/orbsaroundusase01procgoog_marc.xml . Those include LOC call numbers. Using those as a base, I could build up some sort of seeded clustering algorithmthough I dont know how to do that yetthat would let me start the machine classification off of a base of a few hundred books in each of the larger LOC catalog areas Ill be dealing with.

There are also OCLC ids and LCCNs associated with some books, probably more than have LOC call numbers. Those could work as access points too, but would involve requests to LOC or Worldcat in between. It might be necessary for more catalogue data.

What Id _really_ like is for the Internet Archives own cataloging database to move along, but it doesnt seem to have caught up with their scanned books yet in most cases. Maybe I just picked the wrong books to test it on, though.