You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Back to Basics

Nov 08 2010

Ive rushed straight into applications here without taking much time to look at the data Im working with. So let me take a minute to describe the set and how Im trimming it.

The Internet Archive has copies of some unspecified percentage of the public domain books for which google books has released pdfs. They have done OCR (their own, I think, not googles) for most of them. The metadata isnt great, but its usablethe same for the OCR. In total, they have 900,000 books from the Google collectionDan Cohen claims to have 1.2 million from English publishers alone, so were looking at some sort of a sample. The physical books are from major research librariesHarvard, Michigan, etc.

900,000 books is too many for me to process at home. Since Im interested in American history, I made a list of the largest American publishers by doing some preliminary analysis of their catalog, and reading about the publishers online. Publisher names are irregular, so Ive searched for a single keyword to represent a publishing house; eg, Scribner to catch Charles Scribner, Chas. Scribners sons, Scribners, etc.).I keep adding files (which creates more work for me as I retool the database), but the current list is, from largest to smallest: Harper, Appleton, Houghton, Scribner, Putnam, Little, Carey, Ticknor, Riverside. If anyone has any suggestions for what to add, Im open to it. The only big publisher Ive consciously excluded is the Government Printing Officeits one of the largest, and probably could be subjected to a fascinating analysis in its own right, but clearly has a different intended audience than the literary publishers.

A lot of the files are duplicatesI remove books that have the same author, title, and year from being counted twice, but Im sure many books are still creeping in through the cracks. Just to give an idea of what sort of books were catching, heres a randomly selected set of twenty books titles.

> sample(full.catalog[,5],20)

[1] The divorce of Catherine of Aragon; the story as told by the imperial ambassadors resident at the court of Henry VIII..

[2] Levana; or, The doctrine of education

[3] Cold dishes for hot weather

[4] Wilbur Fisk

[5] London City Churches

[6] The writings of Oliver Wendell Holmes

[7] Germany and the Germans

[8] The Novels of Charles Lever

[9] Quality Street: A Comedy

[10] The Crescent and the Cross; Or, Romance and Realities of Eastern Travel.

[11] A passionate pilgrim

[12] The works of Charles Lamb

[13] Advanced Arithmetic

[14] The Spirit of Old West Point, 1858-1862

[15] The moral, social, and professional duties of attorneys and solicitors

[16] The favor of kings

[17] At the mikados court; the adventures of three American boys in modern Japan

[18] George Washington: An Historical Biography

[19] Josiah Wedgwood, F.R.S.: His Personal History

[20] Alexander the Great

Those sound reasonable as a sample of American books from the period. Can you guess the years the books are from? Of course, the sample is overweighted towards the 80s and 90s, when we have the most booksif we weight the sample so were equally likely to end up with books from any period, we end up with something slightly different:

> full.catalog[sample(1:nrow(full.catalog),20,prob=prob^2),c(5,7)]

The Life of John Randolph of Roanoke 1851

Mind in nature; or The origin of life, and the mode of development of animals 1865

Lex Parliamentaria Americana: Elements of the Law and Practice of 1874

On amputation of the cervix uteri: In Certain Forms of Procidentia, and on Complete Eversion of 1869

White Aprons: A Romance of Bacons Rebellion, Virginia, 1676 1896

The works of the Right Honorable Edmund Burke 1877

A Centennial Discourse Delivered to the First Congregational Church and Society in Leominster 1843

Out-doors at Idlewild; or, The shaping of a home on the banks of the Hudson 1855

The Women of the American Revolution 1850

Progressive religious thought in America; a survey of the enlarging Pilgrim faith 1919

Women in industry; a study in American economic history 1909

Recollections of a varied career 1908

The Balkans: Roumania, Bulgaria, Servia, and Montenegro 1896

New England aviators 1914-1918; their portraits and their records 1919

Ernest Maltravers 1838

Genetic Psychology for Teachers 1903

Work for Women 1883

The Atlantic Islands as Resorts of Health and Pleasure 1878

The Grasshopper in Lombard Street 1892

Chronicle of the Coach: Charing Cross to Ilfracombe 1886

Novels, monographs, republicationstheres a mix, that probably varies over timebut that variation itself is interesting.

So how many books does that leave? About 33,000. Heres a quick plot of the years they come from: over 100 books a year from the 1830s, over 200 from the mid 40s. The sample is best for the GAPE, so analyses will be more reliable from then.
Now, 100 books a year is not very much, when you think about itbut 1000 books a decade from the 1830s on seems sufficient. We could do decade counts, but thats just one wayand not the bestof smoothing data. For now, Im using loess normalization, which is a least squares based methodif the data was perfect a moving average would be more appropriate, but given the possibility of outliers (multiple copies of a single book under different authors, say, skewing the numbers for one year) I think loess is an acceptable compromise.

A disclaimer that may not be obvious is that publication date is not the same as date written. For example, about half a percent of the books are by Shakespeare:

For some uses, this is a feature, not a bug. Actually finding the year a book was written is nearly impossible. (Even nowadayslook at an academic monograph, and try to figure how man of the sentences were first written for seminar papers a decade earlier). And in most cases we want to pick up on the currency of whats being printedall those Victorian editions of Shakespeare show us, in part, what people were actually reading.

Theres one category Im worried about, though: collected works editions. Browsing the titles, its clear there are a lot of these1645, to be precise, or about 4% of the total. Heres a random sample to give an idea:

[1] The writings of Charles Dickens, with critical and bibliographical introductions and notes by Edwin Percy Whipple and others;

[2] The works of John Greenleaf Whittier;

[3] The Poetical Works of Edmund Spenser

[4] The Poetical Works of Henry Wadsworth Longfellow: With Bibliographical and

[5] The works of Robert Louis Stevenson

[6] The Complete Poetical Works of Elizabeth Barrett Browning

[7] The Works of Washington Irving

[8] The poetical works of Bayard Taylor

[9] Darwinism stated by Darwin himself. Characteristic passages from the writings of Charles Darwin

[10] The writings of Thomas Jefferson;

[11] The Works of John C. Calhoun

[12] The works of William Makepeace Thackeray

[13] The works of Daniel Webster ..

[14] The Poetical Works of William Cullen Bryant

[15] The Writings of Bret Harte

[16] The writings of George Eliot : together with the life by J. W. Cross

[17] The Poetical Works of Thomas Gray ; with a Memoir

[18] The Writings of George Washington

[19] The poetical works of miss Landon: Complete

[20] The Writings of Mark Twain

There are a lot of problems with these editionsthe metadata makes it harder to eliminate duplicates (because the creator field can be any combination of the author and editor), they tend to be long and generally not selected for contemporary relevance, and we know that they dilute the currency of the vast majority of terms in them. On the other hand, Id really hate to miss out on the data contained in no. 9 above in a study of evolutionary language, and the road of truly cleaning up a sample is one that goes a long way. Im inclined to keep them in for now.

(I should note now that the numbers above are all a little off, because I was double-counting entries from Appleton by accidentit doesnt change the larger point).

Comments:

One note to myselfthere seem to be about 180 boo

Ben - Nov 1, 2010

One note to myselfthere seem to be about 180 books being counted twice at the moment, 170 of which are published by Houghton. I could just remove them, but better to figure out why thats happening to see if it can help delete other duplicates.

FTR, its because a lot books list both Hought

Ben - Nov 0, 2010

FTR, its because a lot books list both Houghton and Riverside in the publisher field, particular later in the period.