You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Posts with tag search


← Back to all posts
Feb 20 2011

I wanted to see how well the vector space model of documents Ive been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if youre sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Labs Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books  in LCC subclasses BF (psychology) blue, and use red for QE (Geology), overlaying them on a chart of the first two principal components like Ive been using for the last two posts: