You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Posts with tag pca


← Back to all posts
May 10 2011

Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesnt happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like outside more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.

Feb 22 2011

Heres an animation of the PCA numbers Ive been exploring this last week.

Feb 20 2011

I wanted to see how well the vector space model of documents Ive been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if youre sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Labs Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books  in LCC subclasses BF (psychology) blue, and use red for QE (Geology), overlaying them on a chart of the first two principal components like Ive been using for the last two posts:

Feb 17 2011

I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, heres an improved (using all my data on the 10,000 most common words) version of that plot:

Feb 14 2011

One of the most important services a computer can provide for us is a different way of reading. Its fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.

Dec 23 2010

Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but Im going to try again. This post is largely a test of whether I can explain principal components analysis to people who dont know about it so: correct me if you already understand PCA, and let me know me know whats unclear if you dont. (Or, it goes without saying, skip it.)

Dec 08 2010

Let me get ahead of myself a little.