Posts with tag Resources
← Back to all posts
I’m changing several things about my data, so I’m going to describe my system again in case anyone is interested, and so I have a page to link to in the future.
Jamie’s been asking for some thoughts on what it takes to do this–statistics backgrounds, etc. I should say that I’m doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don’t think I’m going to do the software review thing here, but there are what look like a _lot _of promising leads at an American Studies blog.
A collection as large as the Internet Archive’s OCR database means I have to think through what I want well in advance of doing it. I’m only using a small subset of their 900,000 Google-scanned books, but that’s still 16 gigabytes–it takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but I’ll still have to figure out what sort of relations are worth looking for. So what are some?