You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Clustering from Search

Jan 11 2011

Because of my primitive search engine, Ive been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually dont get:

1)  Numeric scores on results
2) The ability to from a set of books to a set of high-scoring words, as well as (the normal direction) from a set of words to a set of high-scoring books.

We can start to do some really interesting stuff by feeding this information back in and out of the system. (Given unlimited memory, we could probably do it all even better with pure matrix manipulation, and Im sure there are creative in-between solutions). Let me give an example that will lead to ever-elaborating graphics.

An example: we can find the most distinguishing words for the 100 books that use evolution the most frequently:

> books = names(sort(TFIDF.from.usage(evolution),decreasing=T))

working on item number 1 of 1 (evolution)

> bookTFIDF(books)[1:10]

evolution    organisms     organism      organic      mammals  vertebrates    phenomena    sociology

0.0039030229 0.0014772785 0.0012406171 0.0010053423 0.0009848822 0.0009823578 0.0009781454 0.0009637798

fig   intestinal

0.0008745183 0.0008426900

Evolution itself is the most common, but we get a bunch of other words in theremost have to do with biological evolution. Intestinal probably includes a lot of course-of-diseases descriptions, and sociology, of course, is the discourse that Im most interesting. So thats interesting, but nothing new.

What I think is kind of cool, though, is that we can tell the computer to go into  those words. We can do a bunch of searches to see what words distinguish books that score highly on a search for evolution and organisms, and what words distinguish books that score highly on a search for evolution and society. (NoteIm just arbitrarily using the top hundred books using a word as my cutoff to find interesting words here, but if I had more processing power, Id like to try weighting the results on a scale related to their search score).

A lot of those words will be the same. Organism and organisms, for instance, have a lot of the same vocabulary when we do co-search with evolution:

> bookTFIDF(books[1:100])[1:10]
  evolution    organism   organisms   molecular   psychical     organic
0.002821304 0.002344120 0.001888864 0.001626369 0.001358292 0.001274280
  phenomena       cells integration environment
0.001142712 0.001058429 0.000937243 0.000915636

> bookTFIDF(books[1:100])[1:10]
    evolution     organisms      organism   integration       organic
  0.002881946   0.002109943   0.001685654   0.001314584   0.001276740
        cells     molecular heterogeneity     phenomena           fig
  0.001170736   0.001144163   0.001051732   0.001030783   0.001026932

We can use the information on co-occurrence to cluster words together into patterns of co-use. This is why I was so bold, Hank, as to call what I was doing looking at discoursesthese sorts of scores let us start to cluster words into general conversational groups.

[In the language of text analysis, of course, Im drifting towards not discourses, but a simple form of topic modelling. But Im trying to only submerge myself slowly into that pool, because I dont know how well fully machine-categorized topics will help researchers who already know their fields. Generally, were interested in heavily supervised models on locally chosen groups of texts. Plus, like Pierre Menard, I find it more fun to reinvent the wheel than to be the first one to invent thewhats something crappy and wheel related?steering-wheel desk.]

There are a lot of ways to approach the data, and I dont know the best one. Here are three. Dont think of these as static visualizations, which isnt exactly my thing, but as a process that we can redo for any word or words. When we dont know about a word or phrase, it can be helpful in rapidly immersing us into information about what discourses it occurs in and helping us do things like craft smarter searchers. When we know a lot, there will always be some data that sticks out as oddfiguring out why that seems odd is a route to better understanding.

  1. We can ask for 7 clusters using k-means clustering (powerpoint PDF, but its the one I found useful), on about a hundred words highly tied to evolution.  Ive invented the labels, but the results themselves and the choice of the words to cluster all came out TF-IDF scores.

> clusters = 7; kmeans(graph,clusters,20,nstart=10)$cluster -> dat; lapply(1:clusters, function (i) {names(dat[dat==i])})

Zoology and microbiology?

[1] amphioxus     animals       apes          cell

[5] cells         development   embryo        evolution

[9] evolutionary  fishes        heredity      layer

[13] mammals       membrane      nutrition     ontogeny

[17] organisms     organs        physiological primitive

[21] protoplasm    reptiles      sexual        tertiary

[25] theory        tissue        vertebrate    vertebrates

Anatomy

[1] anterior  cavity    dorsal    fig       gastrula

[6] posterior ventral

Chemistry, math, etc.?

[1] actions          aggregate        aggregates

[4] chemical         differentiations equilibration

[7] equilibrium      genesis          heterogeneity

[10] heterogeneous    hypothesis       integrated

[13] integration      molecular        molecules

[16] motion           motions          similarly

[19] units            vegetal

Philosophy, etc

[1] chromosomes comte       cosmic      fiske

[5] fiskes      germlayers  intestinal  kant

[9] obvgootlc   philosophy  polygyny    science

[13] scientific  spencer     tho         universe

[17] youmans

obvgootlc, Im sure, is an OCR misreading of the Digitized by Google watermarks on a bunch of scans. Ive given up trying to root those out for the time being, since they dont corrupt the rest of the data that much.

Darwinism proper

[1] darwin     darwinism  darwins    geological huxley

[6] plants     selection  species    variation

Biological science?

[1] axis            biology         complex

[4] differentiation environment     homogeneous

[7] inorganic       morphological   nervous

[10] organic         organism        phenomena

[13] processes       structures      tissues

[16] variations      vascular        yelk

Social Darwinism!

[1] activities    altruism      altruistic    biological

[5] consciousness economic      egoistic      ethical

[9] ethics        factors       objective     psychical

[13] psychology    social        sociological  sociology

[17] subjective

Thats K-means clustering. We can also create hierarchical trees using Wards clustering algorithm, also in R:

There are a lot more details to get out of a hierarchical chart like this. I think my favorite is that Huxley is closer to Darwinism than any other word, followed by evolution. Remember, all these correlations are limited to books that use evolution AND the listed word, so a word like theory doesnt necessarily get to have all its other meanings. Its also interesting to see just how different the anatomical language is from everything else (the last five words on the chart). I also like seeing how Fiske, Kant, Spencer and Comte each appear amidst a slightly different cluster of words in the philosophical grouping near the top. If only the sample ran a little later, this is surely where Dewey would land.

Finally, I wondered if there was some way to capture the strength of ties among words using some sort of force-directed placement algorithm like the ones usually used for social networks. Each word has different strengths of ties to other words. This doesnt totally work, but its a little interesting nonetheless, so I might as well post it. (click to enlarge). Its also a slightly different set of words, as Ive kept tweaking with the selection mechanism. With enough work I think one could create some interesting visualizations. But right now, I see the good of this sort of work in helping perfect searches, and Im not sure how deep the insight we get from this sort of view are.

But we shouldnt underestimate shallow insightsthats what computers are best at, and the best of us need them. Reminders of common words, the ability to skim through areas for words in a family, and so oncan be very valuable in the places we might fall flat. (It took one of these, for example, to remind me that a lot of 19th century books that use the word society do so in the context of missionaries).

Historians do a tremendous volume of searches, and we dont always remember just what it is we know. To stretch for an analogy: Intermediate pianists should often play a scale in the key before striking out on a piece to reacquaint their fingers with the environment of the new key. Reviewing visualizations can serve something of the same purposereminding us what we know before we try to apply that knowledge, so were less likely to make common mistakes.

Also, of course, its just fun to see the words arrange themselves into networks that seem to instantiate some sort of semiotic web. Theres social Darwinism in the upper left,  geology on the right, and a cluster of wordsspecies, sexual, cell at the heart. Sometimes its just good to confirm what we already know through a different path.



Comments:

Good stuff, Ben. Are you purposefully riling up de

Hank - Jan 2, 2011

Good stuff, Ben. Are you purposefully riling up defenders of (more traditional definitions of) discourse by adopting the term instead of a different one?

I wonder because youre still going to get push-back from those who think the sorts of thinking youre doing here (which theyll *have* to admit is thinking) is at a level of remove from the sorts out of which historians make their livings.

Of course, this is something you agree with, more-or-less explicitly, in this post - by suggesting we use some of these tools as refreshers, or primers, or a means of getting up over before we dig down deep, youve again defined a terrain for these search tools that separates them off from those with which other (older) scholars are increasingly, haltingly, and then, all of a sudden, stubbornly familiar.

Which, come to think of it, might make the entree of all of this a bit easier - a dash of novelty, a gesture at familiarity. We just dont want it to become another Wordle

Ive enjoyed this post and several earlier one

Anonymous - Jan 2, 2011

Ive enjoyed this post and several earlier ones.

The word discourse gets used differently in different subfields, but the way youre clustering texts here does feel to me rather similar to the way I use discourse, in practice, as a literary historian.

I agree that its useful to be reminded of what we already know in part because were not always really *sure* that we know it. But its not hard to see how this approach could also lead to the discovery of patterns that we werent in fact expecting.

You could apply a similar method to different kinds of initial seeds. For instance, I would be very interested in starting simply with a list of terms whose frequencies seem to peak (say) in a particular twenty-year period, as compared to the twenty years before and after. Youd get terms peaking for a whole range of different reasons. Then you might break that list up into discourses using something like the method youre outlining here. In doing that, I suspect we would discover some very predictable patterns connected to known events but also perhaps some surprising clusters of concepts that define a discourse specific to the period.

Your technique of clustering based on tf-idf weigh

Anonymous - Jan 3, 2011

Your technique of clustering based on tf-idf weighting seems really promising to me.

I dont have tools yet that can do this properly, but I couldnt resist writing a post that speculates about other ways one could use the same technique.

http://tedunderwood.wordpress.com/2011/01/12/identifying-topics-with-a-specific-kind-of-historical-timeliness/

Btw, if I characterized the mechanics of your tech

Anonymous - Jan 3, 2011

Btw, if I characterized the mechanics of your technique inaccurately, please let me know so I can correct it. I think I grasp the way you do this, but I might be getting something wrong.

@Hank - With discourse, Im definite

Ben - Jan 4, 2011

@Hank - With discourse, Im definitely trying to go the comfort route more than the provocation one. Historians toss around discourse and language to mean all sorts of things, and my pointI know its tendentious, toois that the structuralist language of discourse and languages and communities is actually really amenable to computerized analysis. Youre right that its thinking a level removed from a lot of historical work, but the point Im fixated on right now (ie, the last month or so) is that that level has always existed in the form of finding and classifying relevant documents. But now theres a whole new infrastructure we need to understand that supplements relationships to librarians and archivists, whom historians have always known are our most important partners and whose trades weve always tried to understand.

@TedThanks, Ive enjoyed reading your blog, too. Ill try to pop by later and be a little clearer about a few of the corners that Ive cut, but youre definitely helping me to think about a few things. Im been interested in timeline curve similarity since the start of this blog, but its hard to compare all words in terms of peak period just for memory reasons, although Im sure I could better. Thats mostly to saythanks for pushing me back in a direction Ive been meaning to go, and I hope I can figure out how to do it.

Ben - Good stuff. Very quickly: I think your

Hank - Jan 6, 2011

Ben -

Good stuff. Very quickly: I think youre right on about these tools being amenable to the structuralist language that revolves around discourse/languages/communities, and vice versa (as you state yourself).

This suggests to me that these tools can do more than pitch themselves as a supplement to librarians/archivists, or to a lot of historical work, or to the sorts of interpretation and fine-grained Renaissance-style humanism people fear is on the way out if we let computers run the show.

Instead, these sorts of tools *can* be pitched as part of a return to that structuralism that, while floating around (<see?) when we invoke discourse or language, gets left out of analysis in favor of individual agency and choice.

Does that sound right?