You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Genre similarities

Dec 16 2011

When data exploration produces Christmas-themed charts, thats a sign its time to post again. So heres a chart and a problem.

First, the problem. One of the things I like about the posts I did on author age and vocabulary change in the spring is that they have two nice dimensions we can watch changes happening in. This captures the fact that language as a whole doesnt just up and changethings happen among particular groups of people, and the change that results has shape not just in time (it grows, it shrinks) but across those other dimensions as well.

Theres nothing fundamental about author age for thisin fact, I think it probably captures what, at least at first, I would have thought were the least interesting types of vocabulary change. But author age has two nice characteristics.

  1. Its straightforwardly linear, and so can be set against publication year cleanly.

  2. Librarians have been keeping track of it, pretty much accidentally, by noting the birth year of every books author.

Neither of these attributes are that remarkable; but the combination is.

There are plenty of linear variables out there: Id love to be able to see how vocabulary changes lie in time by linear variables like author income, years of schooling, or annual sales figures for books; but no one has been collecting that data. The stuff that has been collected, on the other hand, is essential categoricala book can be fiction, published in Philadelphia, about set theory, in English. Nobody keeps track of any of these as linear variables, though they could (it just barely mentions set theory, it has a lot of French words, etc.)

The trick is to make this categorical data more ordinal. Given something reasonably good at turning publication location into real life places, for instance, you could turn geographical data into latitude-longitude pairs, or into any number of mildly interesting one-dimensional series. (Maybe the adoption of some vocabulary can be modeled well by miles from Muncie, or by city population at date of publication).

But book data just isnt strongly geographical enough to make those sorts of comparisons worth coding. (Newspaper data, on the other hand) And Im particularly interested in genre. What Id really like is some way to make genre information univariate. One way to do this is to create new ordinal genre information through principal components analysis or something. But that doesnt use metadata, just the text, which seems somewhat wasteful. The best genre information we have is probably LC classification numbers; and they are frustratingly almost-ordinal. Q-R-S-T is all science-math-technology type stuff; D-E-F is history leading into the social sciences in G-H-K; and so on. But theres not really a continuous scale from A to Z. Right?

I wanted to get a quick-ish handle on this, and just how similar or dissimilar the various LC classes are, and how that maps to the order theyre shelved in.

This is where the chart comes in. The easiest way to compare genres seemed to be comparing their word usage using cosine similarity. (To keep the data size manageable, I actually compared only words preceding the word are. Good enough, hopefully; it shouldnt seriously compromise the data, but does mean that the variations are mostly about noun-usage, not word usage in general.)

1 is perfect similarity, and anything below about .85 is not very closeIve lumped those together. Every point is colored to show the similarity score of the genre immediately to the left against the genre below. Green is very similar, white is averagely similar, red is not very similar. Youll see a green line running through the middlethats because every genre is identical to itself. The chart, in accidentally Christmas colors (click on the chart to enlarge, and to Wikipedia for a refresher on LC classifications):

This is not one of those charts where the meaning just jumps out. But a few notes:

  1. There are roughly three big groupings that are relatively coherent: the social sciences and humanities, lets call them, A to PN; fiction, PQ-PZ; and the sciences, Q to Z. These map on to the LC classification scheme relatively well, so its not a completely arbitrary mapping.

  2. Some genres are mostly red, meaning theyre entirely sui generis. Most notable is fiction, PZ, which is also the largest one in the collection, and the other P-categories; QA, math; and TK, electrical engineering.

  3. Some genres have green bands running all the way up and down. (Or left and right, since the charts symmetric). Q, R, T, F, and G are like this; notably, those are all general classes. So the books classed as general science or general technology actually do have some lack of specificity, either individually or when averaged out, that makes them closer to random other books. Thats sort of interesting.

Still, it doesnt exactly look like the genres are placed in the best of all possible orders. AE (encyclopedias) looks more like science than like its nearest neighbors in the B category, psychology-philosophy-religion (although that category, which has always felt too much like a grab-bag to me, actually coheres very nicely in a sea of green. The early Ps, which are world literature and literary studies, look more like world history (the Ds) than they do like the bulk of fiction in PR, PS, and PZ. And so on.

So, can we create a single best linear ordering? No, not really. The data is too dimensional for that. That would be like trying to create a single ordering of the cities in North America from the distance grid in the corner of a AAA map. You could run a spectrum from San Diego to St Johns Newfoundland, or from Vancouver to Miami; either would make sense, but neither would work, because the data is fundamentally two-dimensional. (I actually just tried this using lat-long coordinates; principal components analysis runs a spectrum from Providence RI to Eugene Oregon, that Bangor and Vancouver end up in the inside of.) Here, the data has many more than 2 dimensions, which makes a single useful ordering all the less likely.

What we can do, though, is create any number of somewhat useful orderings; to extend the analogy, the best ranking of the cities in North America for me is going to be their distance from Somerville, MA. So we can rearrange this chart by showing the distance of various genres from QH, natural history:

Reading down from the left, QH is identical to itself; next closest is Q (general science, then QL (zoology), QP (physiology), and so on. On the face of it, this doesnt look much better, or much worse, than the original one. We still get some nice groupings, but outside of a few helpful rearrangements close to QH (anthropology is like natural history!) its more arbitrary than the original LC ordering, and certainly not as good as the hierarchy I built using textual data a while back.

Whats potentially interesting, though, about that sort of ordering is that it lets us look at how transmission movesor doesntacross those similarity lines. We know that Q is statically similar to S, and not so to PR; when language changes, how do those similarities affect the changes that happens?

So thats whats next.

Comments:

Up to now I have been taking your interest in genr

Jamie - Dec 3, 2011

Up to now I have been taking your interest in genre for granted, since its obviously awesome, but now I want to know: does it interest you more for methodological reasonswhich are obvious enough in your blog discussionsor is it also rooted in some historical question youve got (i.e., the popular appropriation of certain strands of psychological research)?

Yeah, its definitely historicalI like, as d

Ben - Dec 3, 2011

Yeah, its definitely historicalI like, as do (I think) lots of people doing something vaguely intellectual-history-like, to talk about what psychology did and what sociology did as if they are historical agents or subjects in themselves. (Or at least as if their process of cohering to each other is real interesting in itself). Just as a pure methodological point, I think the geographical stuff is probably more interesting, and Ive love to post more maps here. But yeah, I think Im basically interested in how words/phrases can show how ideas or concepts or practices spread, and Im simply a lot more interested in how (say) the term natural selection escaped from biology than I am in how Abraham Lincoln escaped from Illinois.

Although as for psychology, Im inclining in the direction that popular appropriation of psychology is more a conventional story intellectual historians love to tell, because it lets them keep studying James and Dewey and justifies the effort they put into understanding them; when in fact the actual networks of transmission and distortion look quite different, centered as much on advertising or pedagogy, etc. I am hopeful this might be quantifiable on a big scale, though, yeah.