You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Wordcounts in starting research--what do we have now?

Nov 12 2010

All right, lets put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term scientific method. I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.

The basic theory Im working on here is that textual analysis isnt necessarily about answering research questions. (Its not always so good at doing that.) It can also help us channel our thinking into different directions. Thats why I like to use charts and random samples rather than liststhey can help us come up with unexpected ideas, and help us make associations that wouldnt come naturally. Essentially, its a different form of readingjust like we can get different sorts of ideas from looking at visual evidence vs. textual evidence, so can we get yet other ideas by reading quantitative evidence. The last chart in the post is good for that, I think. But first things first: the total occurrences of scientific method per thousand words.

This is what weve already had. But now Ive finally got those bookcounts running too. Here is the number of books per thousand* that contain the phrase scientific method:

Well, it increases, but the numbers are quite small.001 per thousand is an order of magnitude less than a lot of the patterns weve been looking at. The most citations is 218 in 1902 (the spike in 1885 has 188 occurrences, in a year that I have fewer books for). So we cant look for the invention of the scientific method in 1880s, but we can look at its popularization and changing meanings. If this was a flat line, the study would probably do better to focus on changing meanings than on popularization, because the evidence would be it didnt get all that much popular.

The bookcounts graph gives something of the same story, but it tells more of a story of gradual increase from around 1850, rather than from around 1870 as the wordcounts might lead us to think. I need to think about why this isI think theres an easy mathematical explanation, but Im tired.

So thats something. We might want to compare it to some other movements, look up some of the spikesit looks like there was a book or two in 1885 that used the phrase lots of timesand see if other words have that decline from 1900 to 1910.

What else can we do? Well, we can see what words are used in connection with it. The words that appear in a sentence with scientific method dont tell us very much:

the         of        and         to scientific 
     17359      13708       7545       6346       5538 
        in          a     method         is       that 
      5105       3708       3687       3128       2606

A bunch of common words, and the two constituent words. The only remotely surprising fact is that scientific appears much more often than methodbut thats just because my code counts the phrase scientific methods as well.

If we divide those by the overall totals and multiply by 100, we start to get something interesting:

scientific     method         of         is         in 
     3.012      0.982      0.010      0.009      0.007

That tells us that 3 percent of all occurrences of the word scientific are in the same sentence as the phrase scientific method, and 1% of all occurrences of method. Maybe we want a new count to see what other kinds of methods there are. Clearly the rest of the words have no particular tie. But we can apply the same method to all the words that appear in a sentence with scientific method:

phenomenism  inseparability    irrefragably 
           6.19            4.12            3.59 
  philosophized      scientific         eealism 
           3.04            3.01            2.81 
     positivism presuppositions     ideological 
           2.52            2.34            2.29 
      josselyns          attika     reorganizes 
           2.23            2.22            2.21

Those are some obscure words. Some of them are probably just chance or the results of very small groups talking with each otherirrefragably, which means indisputably, appears just 7 times in a sentence with scientific method, and some of those are probably multiple editions of a text. Our number one hit, though, phenomenism, happens 21 times with scientific method, and 339 overallsounds worth checking out. (I cant remember whether I ever came across that term in reading about the history of phenomenalism). And positivism is obviously an important movement in the late 19C to check out for the origins of anything scientific, though I dont know my mind would take me there right away. If we require a minimum number of hits, we get Comte and the names of a couple social sciences. Not bad.

But while before we were just getting a list of common words, now were just getting words that themselves use scientific method all the time. Our phrase is important to thembut are they important to our phrase? Time for a scatterplot. Well put the words that are important for scientific method (the,of,and so on) on one axis and the words that scientific method is important to (phenomenism, inseparability, etc., on the other). The first pass will just illustrate the numbers I showed above:

Were interested in words that arent close to either of the axes, and the only words discernibly so are scientific,method,and methods. Im not going to lead you through all the transformations, but Im changing the axes so theyre on the same scale, and using logarithmic axes so we can pull more detail out. That makes the general scatter plot look like this, with the words weve been tracking written over it to help get your bearings:

Those striped bands on the left are words that only appear once along with our keyphrase; the next band is twice, and so on until we get to a reasonable sample somewhere around 20, where phenomenism is.  I also highlight a couple of the outliers on the bottomshe, you, and herare strikingly unlikely to appear in a sentence with scientific method. Those sorts of outliers can be interesting, though only when used with care. It might be interesting, say, to see if the language used by peddlers of scientific method is yet more male-centric than other forms of science writing. Scientific, method, and methods still stick out as the words for which we know the phrase is both important and has importance to. But now we can see a bunch of other suggestively positioned words out there to think about as well. Lets zoom in on that portion of the graph, and write all words in.

So how useful is that? Theres certainly some interesting stuff to think aboutthe social sciences stick out far more than the hard sciences, Comte makes an appearance in the flesh, and so on. The working hypothesis Id draw, which isnt completely trivial, is that Comtean social science plays an role in the conceptualization of the scientific method during the period of its emergence in America. Drawing that isnt completely mechanicalit requires me to know that Comte, the social sciences, and positivism all have strong links to each other. (I suppose we could program that in somehow, thoughComte, at least, would be proud.) But I might be completely wrong to seize on thoseIm dismissing the Francis Bacon words that stick out because I think that they (Bacon, Organon, etc.) are probably just throat-clearing in books or history, which is what the Comte stuff may be too. Someone with more knowledge may be able to see better patterns. Also, we might want to do a year-by-year explorationIll leave that for another time. This post is already long, even by my standards.

We could also make these easier to readI love word scatterplots, but they are ugly in their way. It would be pretty easy to code a metric of distinctiveness that gives the distance of any word from the axesthat way we could just get an ordered list in which scientific would be the first word, method the second, methods the third, and then some interesting stuff. I like having the dimensional data, though, easily accessible, and doing that right now is work.

Whats easy, though, is to reproduce that last chart for bookcounts. Instead just limiting ourselves to sentences, we can find out what words appear in the same _book_ as any discussion of scientific method. For the most part, thats going to be more interesting because it works from a larger sample. Here are the 1500 of the 192,000 words that appear in books using the phrase scientific method that show the strongest correlation. Again, farther to the right are words that scientific method frequently uses, and farther up are words that usually appear alongside scientific method. I think Blogger lets you click on the thumbnail to get a large version of the full image, which youll need:

Theres more here than I could describe in a minute, so just a few impressionistic thoughts:

  • biology, sociology, and psychology are the words the farthest out. Physics and chemistry is actually used more often in the books mentioning the scientific method than the first two of these, but they gets pushed back into the cloud because books using physics dont actually talk about the scientific method nearly as much as books using sociology. Presumably this has to do with insecurities

  • Lots of famous scientists in the cloud. Most distinctively situated: Kants prominence shows how differently he was taught then than now. Darwin is as prominent as always. Hegel is somewhat surprising; Huxley is not.

  • In the upper left, the words that rely a lot of scientific method but are less important in scientific methods prominence, are several education wordspedagogy, kindergarten, pedagogical, etc.

Now, if youre not interested in the history of the scientific method, this isnt interesting to you. But we can have the same plot for any other phrase (it takes computing time, but not much work), and I feel like for brainstorming, at least, about the connections of a given topic, these could be quite useful.

What theyre missing, tragically, is the temporal dimension. Any ideas on how to bring that back in? Ive got a couple inchoate ones now.

*(footnote) Two points on this. First, there are not actually a thousand books for any given year, so the number is a little misleading. Second, Its not totally clear what I should count as a book, since there are still a few foreign-language books, some books with completely indecipherable OCR, and so on. For these charts I count a text file in my library as a book if it has the word the at least once (there are several books that

Comments:

Ben this is great - lots of thoughts that will hav

Anonymous - Nov 6, 2010

Ben this is great - lots of thoughts that will have to wait until tomorrow, but just wanted to toss out that portions of my writing on the matter have just been strongly buttressed! As it were! Hank