You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Searching for Correlations

Jan 10 2011

More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. Im thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm Ive been working with can help improve this sort of search. Ill get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?

Ive always liked this one, since its one of those historiographical questions that still rattles through politics. The literature, if I remember generals properly (the big work is David Blight, but in the broad outline it comes out of the self-situations of Foner and McPherson, and originally really out of Du Bois), says that the war was viewed as deeply tied to slavery at the timecertainly by emancipation in 1863, and even before. But as part of the process of sectional reconciliation after Reconstruction (ending in 1876) and even more into the beginning of Jim Crow (1890s-ish) was a gradual suppression of that truth in favor of a narrative about the war as a great national tragedy in which the North was an aggressor, and in which the South was defending states rights but not necessarily slavery. The mainstream historiography has since swung back to slavery as the heart of the matter, but there are obviously plenty of people interested in defending the Lost Cause. Anyhow: lets try to get a demonstration of that. Heres a first chart:

 

How should we read this kind of chart? Well, its not as definitive as Id like, but theres a big peak the year after the war breaks out in 1861, and a massive plunge downwards right after the disputed HayesTilden election of 1876. But the correlation is perhaps higher than the literature would suggest around 1900. And both the ends are suspicious. In the 1830s, what is a search for civil war picking up? And why is that dip in the 1910s so suspiciously aligned with the Great War? Luckily, we can do better than this.

[As a reminder, Im using search terms here to combine words. It would be better, to be sure, if I just loaded in the bigram (digram?) for civil_war into my database, but Im going to be lazy about that for now . This is like not putting quotes around civil war when you put it into a search field on an old websitenot Google, which Im sure has layers upon layers of optimization, but maybe JSTOR or LOC American Memory or something.]

My answer, naturally, is that we can improve it. Like I keep saying, we need to view the data we get out of computers as part of an iterative search. This is just like searching in any other databasewe just have to make our terms better or our search engine better. Specifically, Im going to do two things. The first is pretty basicIm going to stick Lincoln in with the search terms on the Civil War side. The second is on the technical sideIm going to tweak the TF-IDF algorithm so it responds less readily to increases in only one of the two terms (like war) by using the geometric instead of the arithmetic mean. That gives us the following chart, including the results for the original search with the new algorithm):

 

Thats much better. Even on the original search, the algorithm clears out that original spike and somenot allof the Great War noise. But adding Lincoln in really clears up the picture quite a bitalmost no correlation before 1860, a drop immediately after Reconstruction, and then a definite climb back up around 1890, sustained completely through to 1922.

If individual words are so important, can we really trust that any of these results mean anything? Thats a real problem. Im missing some things like War between the States and so onmaybe they have completely different results. But: if were smart about our search terms, we can check these things. What else defines the civil war besides its name and Lincolns? Lets try putting the names of the belligerents:

Thats fantastically similar, and goes a long way towards proving this really measures something about the discourse of the civil war, and not just random word fluctuations for civil and war. (For the record, my first try was for confederacy and army, which wasnt quite so close). The biggest difference is that Lincolns name isnt so closely tied to the other words during the war years themselves, and that theres a base correlation between the political words union and confederacy and slavery. That shouldnt be too surprising.

So what is this showing? That the civil war remained more about slavery in American culture in the Jim Crow era than we might have thought? Certainly not anything so expansive. Lets talk about the limitations.

The most important is that Im just looking at a few books4,000 have all the words confederacy, union, civil, war, and lincoln, for example. Were talking about these books, not culture. (This is basically the standard ngrams disclaimer, even they do have more10x in this period, FTRbooks than me).

In particular, Ive selected for large publisher, which means almost all my books are published in Boston, New York, or Philadelphia. For a topic like the Civil War, wed definitely want to download a few more books published south or west of the Susquehanna before coming to any final conclusions.

The lull in slavery language in the late 1870s might also come, hypothetically, from a different type of book being written about the wara rush of generals memoirs or institutional histories that didnt talk about slavery at all. The best way to answer this would a combination of searches looking at books to see if patterns and jump out, and some more charting. We can ask, for instance, how strongly books using union, confederacy, and battle correlate to books just using union and confederacy to get a sense of how important battles are to Civil War books. That does indeed go up after reconstruction, but it stays up even as the language of slavery rises again, so I dont think theres a strong conclusion to be made.

Finally, theres a more interesting problem. I cant search for negatives right nowthat requires some sort of language processing. If a book from 1895 says the civil war had nothing to do with slavery, Im counting it as tying the Civil War to slavery.

That one is maybe unfortunate, but not as much as it appears at first blush. What Im interested in are discourses around a topicwhether or not authors of books felt the need to address an issue, not necessarily which side they came down on. Opposites are always firmly situated within the same frame, and the need to deny something is an indication that the author thinks the reader might believe it. Thats a sort of reading historians are accustomed to practicing, and if were limited to it in certain spheres of digital history, its not such a bad thing.

Next up: some charts to separate out different families of discourse.

Comments:

This is so cool.

Jamie - Jan 2, 2011

This is so cool.