You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Infrastructure

Nov 14 2010

Its time for another bookkeeping post. Read below if you want to know about changes Im making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princetons supercomputer time, and why I didnt just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.

Currently, Im using perl for data preprocessing and wordcounts, and R for exploratory analysis. Im adding MySQL soon (see below). Mostly, thats because those are the only two scripting languages I know, but theyre fairly well suited for this sort of work. And theyre free. Perl is great at quickly and efficiently reading through large batches of text files, so I use it to first clean up the punctuation on those files and (attempt to) divide it into sentences, and then to create flat text files with various sorts of wordcounts. Its also not bad for other sorts of preprocessingI use it, eg, to identify usable volumes in the search results I get from the internet archive (they let you download thousands of entries at a time), and then to batch download the text files from their site.

R is a program I actually like using, that implements all sorts of statistical functions very transparently, and has lots of nice options for graphical output. It takes a lot more time to get a chart looking right than it does in Excel, but once you do, it makes it easy to create a script that will produce the same chart, with user-defined variations, for anything else. Theres no other way these wordcount charts would be worthwhile if I couldnt just feed a list of words into R and have it spit out a nicely formatted comparison of all of them.

The problem that R and perl share is that they both like to store all the data theyre working on in RAM. This is a problem, because Im working with quite large stores of data. I have a file that lists just each of my 200,000 words, and then how many times they each appear in a yearthats 250 MB now. If I wanted to segment it any more (by genre, say, or originating library) the file would be too large to load into R. (Not, perhaps, on the supercomputerbut theres a way that I can keep this on my laptop).

Anyone who knows about computers is probably wondering why I havent just started using a database already. Partly, its because Im worried its going to be slower for the exploration than R is. Mostly, its because Im scared of storing data any file format that doesnt have text files I can check in on. But its clear this has to happen before I can make any progress on dealing with genre issues, which Im worried about. It will have a lot of ancillary benefits too. The current way that I calculate books that appear in a word is hopelessly baroque, involving reading through a number of files twice, and requiring some really ugly code. It also doesnt leave any easy way on disk to find out coincidences of two words in a book, which is something that would, to put it mildly, be nice to havethats the thing Henry asked for above for evolution, and its bad that currently I have to run a perl script that takes an hour to run to get an answer. The current system also doesnt let me do things like see if books that use the phrase natural selection use the word species, say, more often than books that dontit only lets me sees if they use the word species once, and then stops counting. There are a lot of other, little reasons like this that using a database will make more interesting analysis possible.

The reason is that having more memory to work with means, hopefully, that I can switch from storing wordcounts by year (16 million entries) to storing them by book (probably a couple hundred million entries). Each book entry can have lots of data tied to it, and I can extract that in different ways. The next step, which would be great for syntactic analysis in particular, would be having a database entry for every line in a book (yet more entries), and maybe to store the books themselves by sentence in there. That would allow neat things such as actually displaying the sentences that, say, use Lamarck and Darwin together in the course of exploring data.But that would take a lot of space, and Im already using up about 35 GB of hard drive space for this. I want to make sure it works first.

So thats been my Saturday project. Hopefully Ill get it done by the end of the dayIve got the perl script putting data into a table running in MySQL, but I still have to get the catalog in so I can make selections by year, genre, etc.