You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

PCA on years

Feb 17 2011

I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, heres an improved (using all my data on the 10,000 most common words) version of that plot:

I have a professional interest in shifts in genres. But this isnt temporalits just a static depiction of genres that presumably waxed and waned over time. What can we do to make it historical?

Well, the first thing to remember is that, just as a genre can be a single text for the purposes of pca (or any other analysis), so can a year. We can drop onto that same chart all of the years from 1822 to 1922, say, shading from yellow for the early years to red for the later ones:

Now, this loading of pca dimensions onto each other is probably not kosher, and Im pretty sure that only the directionality is for real thereie, 1912 probably would be closer to the center of the graph if this were done perfectly. And its a bit of a mess.

But rather than make these two different sets live together, something Ill get into more in a later post, we can run a separate pca analysis on just the years. Remember, pca finds the set of coordinates that discriminate most strongly between the inputs you give it. So here, the first principal component is a set of weightings for different that distinguish the most strongly between years. On one side are those that use words like the following a lot:

> paste(names(sort(comps$rotation[,1])[1:20]),collapse=,)
[1] justly, circumstances, occasion, which, prudent, acknowledged, notwithstanding, disposed, rendered, commencement, circumstance, mode, pursued, excite, attended, whole, former, their, bestowed, effectually

And on the other are years that use words like the following:

> paste(names(sort(comps$rotation[,1],decreasing=T)[1:20]),collapse=,)
[1] outside, helped, background, ignored, needed, appreciation, started, back, anything, across, significance, anywhere, start, recognition, worked, out, confronted, largely, based, help

If you click on the links to ngrams, its clear what these arewords that change steadily in their use over time, and are therefore good at telling years apart. Heres a chart of every years vocabulary scored by the first principal component:


There are a few aberrationspca seems to think 1871 looks like it belongs in the 1890s, which is oddbut generally its quite good for a system that had no idea what order the years should go in.

[Edit: Playing around with the animated version, which should go up soon, it becomes clear why 1871 is such an outlier: it has a one-year explosion of magazines in the sample, which tend to use more modern language for a variety of reasons].

So thats the first principal component. Read no more if you only care about the useful. The later ones are stranger: where stuff either gets useless, or really interesting, or both. The thing to remember about PCA is that it finds perpendicular dimensions in the data so that there is zero correlation between two components. So if the first principal component finds terms that move up or down across the year horizon, subsequent ones will find patterns that have no overall temporal trend. In practice, that means the second component will be words that have either peak or trough towards the middle, and are higher on the ends. Heres a link to them on ngrams. (Remember, my data is different than ngrams, so it the patterns probably stand out better if I plot them myself. But this is easy, and keeps me honest).

I think its possible somehow, somewhere those numbers could be usefulyou could tweak the endpoints to find temporary peaks in any given period, or troughs. Not as sophisticated as trend line fitting like they used for the suppression indices in the Science paper, but possibly much faster, which is one of the great strengths of eigenvectors, I think. Anyway, thats not really my department.

Later components lead nowhere good. On a sufficiently large dataset, I thinkunless Im missing somethingthe principals components from number 2 on would probably come to find words whose patterns mirror the harmonics of a string. Now, Ive been saying for a while that digital tools are good for poststructuralists. But in the spirit of the big tent that seems to be the DH issue of the day, let me point that heres a case where computers would be good for Pythagorean mystics or modern-day Spenglerians trying to eke out a living on the margins of the academy. Someone should run some tests on the ngrams sets to see if cyclical patterns occur more often than they should by chance: if so, Scienza Nuova, here we come!

Of course, I think thats all completely ridiculous. So let me extricate myself from this whole line of thinking by drawing the moral that we have to keep a pretty firm hand on the tiller as we go fishing for hypotheses, or else we can end up elevating tools and techniques over method. That said, Im going to bank that first axis, that shows steady linguistic change over time, for later use, because I think it does show something useful.

Comments:

Ben: Last comment for today, and now I think I

Hank - Feb 6, 2011

Ben: Last comment for today, and now I think Im up to date on what youve been thinking. I like the distinction between method and tools - its an important one for a whole host of reasons. One:

We readily sense that tools get used, but, personally, I think we too readily do the same for methods. These latter seem somehow larger to me - do they to you? - and so coming up with a better way of talking about how theyre developed and why might:

  1. help us scramble out of a vocabulary of man-the-tool-user in the history of ideas and (B) situate our own use of digital tools in a wider context of developments in the discipline or in the humanities as a whole. Make any sense?

Possibly something useful? Topics over Time http:

Allen Riddell - Feb 6, 2011

Possibly something useful?

Topics over Time http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.152.2460

Dynamic Topic Models http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.62.2783

@Hank Its probably not worth concretizing, b

Ben - Feb 6, 2011

@Hank

Its probably not worth concretizing, but I sort of want methods to be approaches, big-picture stuff, which tools can help instantiate. Close to a strategy/tactics distinction, I guess. So yeah: I think that certain digital tools can be, if well controlled, relevant to wider developments in the field as a whole.

@Allen

Yeah, Ive been wondering about topic modelling for a while, but just holding back from the overhead of actually learning/doing it in depth. Partly because there seem to be all these subtly different approaches - CTM, tLDA, that I dont feel qualified to discriminate among, and partly because I know it will be a while before I could get as good an understanding of any probabilistic model as I get intuitively of the vector-space models Im using here, even though I do believe theyre probably better. Its a good point this might be the place to dive in: these genre groupings would be a lot smaller to start, which is probably a good thing. Ive been wondering a lot about the interpretibility of topic model for exploration, not just classification, which is what it seems to be built for theres definitely something there.

Somehow David Blei, who seems to be Mr. Topic Models, has been enticed to come by the history department this week to talk about electronic archives; depending on what he says that might really start me off in a new direction.

Have you used them yourself at all? I havent seen any good examples of digital humanists using topic models for anything but smaller, fun projects.

I feel like PCA is pretty opaque and the results a

Allen Riddell - Feb 6, 2011

I feel like PCA is pretty opaque and the results are hard to interpret for text. Topic modeling seems a step better and you can target it to specific hypotheses easier, IMHO. It does require a bunch more computing power. Theres a nice R package, topicmodels that you might want to check out.

Here are three examples:

  1. Bleis topic modeling of Science magazine

  2. David Halls intellectual history/topic modeling of the linguistics journals Studying the History of Ideas Using Topic Models

  3. Cameron Blevins topic modeling 27 years of 18th century diary entries http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/

Allen, Thanks, that Hall paper is great and I wis

Ben - Feb 0, 2011

Allen,

Thanks, that Hall paper is great and I wish Id seen it earlier. And Id seen Blevins stuff in passing, but he goes more in depth than I thought at first. I just taught the Ulrich book last week, and its neat how some of the charts she spent a while creatingthe decline of Ballards midwifery practice, sayjust pop out.

I know topic modeling is the future, but Ive just got an affection for vector models that I can actually intuittopic modeling seems a bit too much like magic, and Id have to write an implementation to dispel that. Also, as you say, it takes more power: I suspect that some of the established models might choke my computer against 20GB of data. Have you used the MALLET package? Ive heard thats better than the various R packages for some reason.

Ben, Ive never looked at the Ulrich book. I&

Allen Riddell - Feb 1, 2011

Ben,

Ive never looked at the Ulrich book. Im going to go request it now. Thanks for the reminder.

I think I get your point about being comfortable with the vector model. What Im a little confused about is why PCA is any less removed from the word frequencies than the topic model is. Personally, I think its easier to explain Bayes rule and probability distributions than to explain an orthogonal projection matrix. But maybe youve had some experience explaining PCA to folks

Allen, I definitely dont want to dig myself

Ben - Feb 1, 2011

Allen,

I definitely dont want to dig myself in behind PCA, because I certainly havent found it that easy to explain, particularly past the first component. (That one is pretty easy, actually, which is why I like having this year rotation data on hand). And once you do explain it, you have to spend a lot of time issuing caveats, because the intuitive explanations of the axes are always going to be correlated with each other. I do think spatial transformations stay a little closer to the original data in some ways (although thats more true for a technique like cosine similarity than what Im doing, which requires a lot of scaling and normalizing).

I guess my intuition is that topic modeling is, as it says, modeling, while PCA is just rearranging. When I say it like that, I think thats probably an incorrect view; my only question is whether its an idiosyncratically incorrect view, or one that other humanists might share. In which case, PCA might be able to find a niche.

Youve convinced me that I need to try PCA, at

Anonymous - Feb 2, 2011

Youve convinced me that I need to try PCA, at least as a exploratory technique.

@Ted: You almost make it sound like a hallucinogen

Ben - Feb 2, 2011

@Ted: You almost make it sound like a hallucinogenic. Go for it, man: expand your mind, see the higher dimensions.