You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Where were 19C US books published?

Jan 31 2011

Open Library has pretty good metadata. Im using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While Im waiting for some indexes to build, that will give a good chance to figure out just whats in these digital sources.

Most interestingly, it has state level information on books you can download from the Internet Archive. There are about 500,000 books with library call numbers or other good metadata, 225,000 of which are published in the US. How much geographical diversity is there within that? Not much. About 70% of the books are published in three states: New York, Massachusetts, and Pennsylvania. Thats because the US publishing industry was heavily concentrated in Boston, NYC, and Philadelphia. Heres a map, using the Google graph API through the great new GoogleViz R package, of how many books there are from each state. (Hover over for the numbers, and let me know if it doesnt load, there still seem to be some kinks). Not included is Washington DC, which has 13,000 books, slightly fewer than Illinois.

google.load(visualization, 1, { packages:[geomap] }); google.setOnLoadCallback(drawChart); function drawChart() { var data = new google.visualization.DataTable(); var datajson = [ [ US-AK, 17, AK 17 ], [ US-AL, 328, AL 328 ], [ US-AR, 126, AR 126 ], [ US-AZ, 57, AZ 57 ], [ US-CA, 4857, CA 4857 ], [ US-CN, 5, CN 5 ], [ US-CO, 408, CO 408 ], [ US-CT, 2964, CT 2964 ], [ US-DC, 13203, DC 13203 ], [ US-DE, 106, DE 106 ], [ US-EK, 1, EK 1 ], [ US-FL, 165, FL 165 ], [ US-GA, 669, GA 669 ], [ US-HI, 125, HI 125 ], [ US-IA, 877, IA 877 ], [ US-ID, 32, ID 32 ], [ US-IL, 14398, IL 14398 ], [ US-IN, 2185, IN 2185 ], [ US-IO, 5, IO 5 ], [ US-IR, 1, IR 1 ], [ US-IW, 1, IW 1 ], [ US-KA, 1, KA 1 ], [ US-KN, 1, KN 1 ], [ US-KS, 654, KS 654 ], [ US-KY, 657, KY 657 ], [ US-LA, 438, LA 438 ], [ US-LI, 5, LI 5 ], [ US-MA, 38566, MA 38566 ], [ US-MD, 2820, MD 2820 ], [ US-ME, 986, ME 986 ], [ US-MG, 1, MG 1 ], [ US-MI, 1498, MI 1498 ], [ US-MM, 1, MM 1 ], [ US-MN, 1200, MN 1200 ], [ US-MO, 1956, MO 1956 ], [ US-MS, 103, MS 103 ], [ US-MT, 91, MT 91 ], [ US-NA, 2, NA 2 ], [ US-NB, 348, NB 348 ], [ US-NC, 685, NC 685 ], [ US-ND, 103, ND 103 ], [ US-NE, 2, NE 2 ], [ US-NH, 820, NH 820 ], [ US-NJ, 1579, NJ 1579 ], [ US-NM, 95, NM 95 ], [ US-NV, 32, NV 32 ], [ US-NY, 99118, NY 99118 ], [ US-NZ, 1, NZ 1 ], [ US-OH, 5072, OH 5072 ], [ US-OK, 106, OK 106 ], [ US-OR, 368, OR 368 ], [ US-PA, 19680, PA 19680 ], [ US-PE, 2, PE 2 ], [ US-PO, 1, PO 1 ], [ US-PR, 1, PR 1 ], [ US-QA, 1, QA 1 ], [ US-RI, 861, RI 861 ], [ US-RY, 1, RY 1 ], [ US-SA, 2, SA 2 ], [ US-SC, 574, SC 574 ], [ US-SD, 79, SD 79 ], [ US-ST, 2, ST 2 ], [ US-TN, 791, TN 791 ], [ US-TX, 674, TX 674 ], [ US-US, 1, US 1 ], [ US-UT, 269, UT 269 ], [ US-VA, 1574, VA 1574 ], [ US-VI, 1, VI 1 ], [ US-VP, 1, VP 1 ], [ US-VT, 495, VT 495 ], [ US-WA, 364, WA 364 ], [ US-WI, 1480, WI 1480 ], [ US-WS, 1, WS 1 ], [ US-WV, 222, WV 222 ], [ US-WY, 43, WY 43 ], [ US-XX, 437, XX 437 ], [ US-YT, 1, YT 1 ] ]; data.addColumn(string,state); data.addColumn(number,books); data.addColumn(string,hovervar); data.addRows(datajson); var chart = new google.visualization.GeoMap( document.getElementById(GeoMap_2011-01-29-13-18-56) ); var options ={}; options[dataMode] = regions; options[width] = 600; options[region] = US; options[colors] = [0xE8FFDC, 0x48A200,0x3E892C ,0x368124, 0x2E791C,0x267114]; chart.draw(data,options); }

Im going to try to pick publishers that arent just in the big three cities, but any study of culture, not the publishing industry, is going to be heavily influenced by the pull of the Northeastern cities.

This raises some interesting questions about how well book data works for generalizations about American culture as a whole. For a lot of purposes, including the one that Culturomics says its interested in, a well-cultivated collection of scanned newspapers with text files released into the public domain with metadata would be much better. Newspapers are generally published in the places some of their editorial content comes from (although a lot of it was republished/stolen from other newspapers in the 19C, right?) so it would let you see, say, some really interesting things depending on how much data you got. You could see, over several days, the spread of news about Mexican War battles. Or you could trace newspaper coverage of campaigns to whistlestop tours, or compare the relative newspaper coverage of a dull campaign like 1888 to an exciting one like 1896. Id want to see, maybe, how discussion of the League of Nations tracked Wilsons tour. Its easy to think of a lot of things like this. We currently have scanned newspaper databases, but as far as I know, they dont release their text and metada, but rather shoehorn you into a web interface. Maybe we just need individual dissertators to strike content agreements to do research, but of course Id rather see everyone have free access like IA gives to books. Is anything like that out there or coming down the pike?

Since Im doing more traditional intellectual history, Im not worried about using books instead. Im mostly interested in not regional variations, but genre variation; I want to know what psychology books say, and I dont really care if few were published in the South. But it does affect other types of questions we might ask. Just thought Id throw that out there.

Comments:

Note: one mistake with this chart is the Nebraska

Ben - Feb 3, 2011

Note: one mistake with this chart is the Nebraska data, which I thought I fixed but realize I didnt. As far as I can tell, whoever set up the codes used NB for Nebraska instead of NE, so it shows up as two here instead a few dozen/hundred like it should.