You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Author Genders: methodology

May 07 2012

We just rolled out a new version of Bookworm (now going under the name Bookworm Open Library) that works on the same codebase as the ArXiv Bookworm released last month. The most noticeable changes are a cleaner and more flexible UI (mostly put together for the ArXiv by Neva Cherniavksy and Martin Camacho, and revamped by Neva to work on the OL version), couple with some behind-the-scenes tweaks that should make it easy to add new Bookworms on other sets of texts in the future. But as a little bonus, theres an additional metadata category in the Open Library Bookworm were calling author gender.

I dont suppose I need to tell anyone that gender has been an important category to the humanities over the last few decades. But its been important in a way that makes lump categories like this highly fraught, so I want to be slightly careful about this. Ill do that in two posts: this one, explaining the possibilities and limits of the methodology; and a follow-up that actually crunches the data to look at how library holdings, and some word usages, break down by gender.

To go back to basics: the basic idea of Bookworm has been that that metadata is a powerful way of statistically analyzing texts. (I know this probably sounds like a vanilla statement: but topic modeling and other methods developed for unstructured libraries occupy so much space in DH projects that I think it isnt). Some of them are immediately kept by libraries, like publication place or Library of Congress subject heading: others are derived, like author age.

Gender is now one of the latter categories. It is not something that librarians (usually) write down: but its easy to get some statistics that are quite obviously meaningful (although what exactly they mean is a little opaque). We can do this by looking at the names of authors.

I think these sorts of derived metadata categories offer some of the most interesting information we can get, but they also require a greater tolerance for error. A lot of the time, they will be wrong.

Humanists tend toward violent perfectionism when it comes to mistakes in numeric information, but I think this is a mistake when applied to the sort of data were using here. The nice thing about author gender metrics is that they will err, for the most part, in exactly the same direction as humans and humanists do. Both humans and computers will jump to the conclusion that someone named Elmer must be male, even if she isnt.

If you go to Bookworm and play around with some gender searches, you can see this in action. One of the nice things about the interface is that you see the constituent books of a line by clicking on the graph. That means that theres no need to blindly trust the gender algorithm; by clicking on the charts and just looking at what pops up any user will see that he or she, too, would assume from the title page that a given author is female.

Of course, there are a few reasons humans and computers _both might _be wrong. But before I get into that, let me explain the general principle used. Its really simple.

  1. Parse all the first names in 1% samples of the 1910 and 1920 US Censuses (from IPUMS) and get their genders. (1910 and 1920 are the only ones with name fields, I believe).

  2. Compare those to the first name of the first author of each book in the Open Library.

  3. If the authors first name is associated with a single gender more than 97% of the time in the census, assume the book was written by someone of that gender. Otherwise the name is ambiguous (e.g., Ashley, A.J., or Leslie); then, dont assign a gender.

Here, to give the general idea, is a list of the names being used: on the outside of the black lines, theyre classifying gender; on the inside (Lois, Christian, Francis and Frances) the names are rejected as too unclear.

American names in 1910 and 1920 are a reasonable proxy for gender, but things may be different in Britain, in Germany, and so forth. It would be easy to get the accuracy up a smidge by incorporating author birth date and some location information; for the last 100 years, Social Security data would work for Americans as well. That seems eventually worth it, but not necessary for a first pass.

In the selection corpus, this is labeled Probably male and Probably female, because its not 100% accurate. How accurate is it? Im reluctant to hazard a guess: right now, quantification would only give false certainty. Its not 97% accurate, because of pseudonyms and initials will end up masking or reclassifying a lot of writers. (Although as I wrote last month, author use of initials seems not to be as systematically skewing of gender conclusions as I would have thought--theyre probably hiding as many men as women.) Although a few authors have some biographical information we could check for gender, theres no reason to think that theyre representative of the full population of library authors.

*More technically, its not 97% accurate as well because it assumes that a given book is equally likely to be male or female, which we know not to be true. A better prediction of author gender for an individual book would seem to come from a Bayes classifier that took into account that most books are by men. (Just how many is something we can only learn recursively from the names). That would, effectively, push both lines above to the left. (For example, calling Claude unambiguously male but bumping Doris from definitely female to ambiguous). Since the goal is the overall trendlines, not the individual books, Im not sure thats desirable. Still, currently its likely that there are more stealth men among the women than vice versa. A side-effect of that would be, though, that women would be even more underrepresented, and there seems something vaguely undesirable (or is it just unfair?) to me about that. Moreover, the census data is itself suspect in a way that suggests to me that theres a bit of mean-reversion preventing any name from being as uniformly male or female as it is in the wild: I bet a lot of the missing distinction between Francis and Frances, for example, is because the 1910s enumerator or the 1990s data enterer misspelled the name, or wrote it down before proceeding on to fill out the gender column. Any thoughts?

Given all that, this can only take us so far. What it _does _reflect pretty well is what the reader would think when she picked up a book. Exhuming the authors and counting X-chromosomes would be interesting and all, but not obviously a better way of looking at the way gender acts in society; the publishers knew as well as we what names sounded female, and perhaps the most interesting questions to be answered by these methods are about what language readers thought they were seeing used by women, and what language used by men; or what language publishers and librarians allowed to reach the public by each gender.

Which is to say: I suspect that if there are sound applications for this sort of data, they will build off of everything weve learned in the last decades about the social constructedness of gender. This is less good at restoring lost female voices, or showing the applications of men, than it is at showing the restraints that gender may have placed on individual actors, in ways they may not fully have been aware of.

The constraints on gender are so strong in the late 19th century US that a lot of questions one might want to askdo men and women address readers differently? are they interested in different things?are only possible with carefully constructed subsets, if at all. Women only publish with certain publishers, in certain genres, under certain expectations; these comparisons can show the constraints on what femalewhether in real life or in pseudonymcould say.* (Any comparison between the full corpus, for example, will be highly towards fiction and education, the areas with the highest percentage of female authors).

*Contradicting that stance, theres actually a fourth step to the algorithm right now. If the author was born in 1819 and is named George Eliot, we classify them as female. I added that because I figure 80% of the time that someone decides to try to trick the algorithm, thats the example theyll go with; and I feel like denying them the pleasure. In principle, I want to remind people that machine categorization and human knowledge arent always at odds. Maybe well retire it at some point: it actually changes some charts people might see, since the default search is often for natural selection, and Eliot is one of the first female novelists to use the phrase. Well see.

As always, this raises some funny questions about organizational imperatives. As I said to some people at a big established digital repository a couple weeks ago, this sort of fast-and-loose summary classification might be administratively difficult for a bigger organization. Releasing a classification with known mistakes is just not appropriate for big curators of data. Still, I think its worth trying.

On the other hand, there are real benefits. In certain fields like the sciences, it can be extremely difficult to find books by women; this search engine is one of the only ways I know to find and arrange them from a big corpus (about 1,000,000 books, which are broadly representative of university library holdings). And although the data wont be perfect, I cant emphasize the degree to which this doesnt matter for the sorts of questions that can usefully be asked of it. If there are substantial differences between writing by the two genders, they will show up with a 90% accurate classifier, just more weakly.

Those disclaimers out of the way: let me know if you find anything interesting. Ill release some basic statistics on the genre distributions, usage of personal pronouns, and so forth once I get a chance.

Comments:

First and last names are available in all the hist

Evan Roberts - May 1, 2012

First and last names are available in all the historical samples (1930 and earlier), and will be available in 1940 soon.

Youre right: its as clear as day on the

Ben - May 1, 2012

Youre right: its as clear as day on the IPUMS web site. I have no idea why I thought they werent.

Instead of guessing at author gender, why not look

Unknown - May 2, 2012

Instead of guessing at author gender, why not look them up in Freebase for those which are available and only fall back to guessing if you dont get a hit?

You can lookup by LC NAF, OpenLibrary and a bunch of other IDs. Theres also http://genderednames.freebaseapps.com/index?name=Sue

(this is a follow-up to my post below, which I put

Ben - May 2, 2012

(this is a follow-up to my post below, which I put in the wrong place);

I just logged onto Freebase and classified a few genders myself to see what it was likeI see I may be wrong that machine tagging is predominant, and that a lot of it is likely scanning the wikipedia article for gender clues and then entering it in. Which is pretty valuable. That should be better in a lot of cases than just automatically classifying. So the question is just

  1. Whether enough OL authors are in freebase to make it worth the time entering them.

  2. Whether a good gender classifier based on census data including location, etc., would be better than a good one based on freebase.

  3. Whether I should just keep freebase independent as a check on the methodology.

Im a very non-technical person that is trying

Anonymous - May 2, 2012

Im a very non-technical person that is trying to find out (even just roughly) the % of female vs male authors (globally or nationally) can you give any indication of this?