Posts with tag BookwormBack to all posts
It’s not very hard to get individual texts in digital form. But working with grad students in the humanities looking for large sets of texts to do analysis across, I find that larger corpora are so hodgepodge as to be almost completely unusable. For humanists and ordinary people to work with large textual collections, they need to be distributed in ways that are actually accessible, not just open access.
I mentioned earlier that I’ve been doing some work on the old Bookworm project as I see that there’s nothing else that occupies quite the same spot in the world of public- facing, nonconsumptive text tools.
I used to blog everything that I did about a project like Bookworm, but have got out of the habit. There are some useful changes coming through through the pipeline, so I thought I’d try to keep track of them, partly to update on some of the more widely used installations and partly
As I often do, I’m going to pull away from various forms of Internet reading/engagement through Lent. This year, this brings to mind one of my favorite stray observations about digital libraries that I’ve never posted anywhere.
As part of the 2016 Republican Primary, Jeb! Bush released a website enabling exploration of e-mails related to his official accounts as governor of Florida in the early 2000s. This whole sentence has an antiquity to it; the idea of pre-emptive disclosure (in large part to contrast with his presumed general election opponent, Hilly Clinton) seems hopelessly antique. And at the time, it was critized for accidentally disclosing all sort of personal information, both stories and Social Security Numbers. It did not make Jeb! president. Anyhow, back then I downloaded Jeb!’s e-mails–and Hillary’s–to think about what sort of stuff historians will do with these records in the future.
Just some quick FAQs on my professor evaluations visualization: adding new ones to the front, so start with 1 if you want the important ones.
-3 (addition): The largest and in many ways most interesting confound on this data is the gender of the reviewer. This is not available in the set, and there is strong reason to think that men tend to have more men in their classes and women more women. A lot of this effect is solved by breaking down by discipline, where faculty and student gender breakdowns are probably similar; but even within disciplines, I think the effect exists. (Because more women teach at women’s colleges, because men teach subjects like military history than male students tend to overtake, etc). Some results may be entirely due to this phenomenon, (for instance, the overuse of “the” in reviews of male professors). But even if it were possible to adjust for this, it would only be partially justified. If women are reviewed differently because a different sort of student takes their courses, the fact of the difference in their evaluations remains.
I promised Matt Jockers I’d put together a slightly longer explanation of the weird constraints I’ve imposed on myself for topic models in the Bookworm system, like those I used to look at the breakdown of typical TV show episode structures. So here they are.
I’ve been seeing how deeply we could integrate topic models into the underlying Bookworm architecture a bit lately.
My own chief interest in this, because I tend to be a little wary of topic models in general, is in the possibility for Bookworm to act as a diagnostic tool internally for topic models. I don’t think simply plotting description absent any analysis of the underlying token composition of topics is all that responsible; Bookworm offers a platform for actually accessing those counts and testing them against metadata.
I thought it would be worth documenting the difficulty (or lack of) in building a Bookworm on a small corpus: I’ve been reading too much lately about the Simpsons thanks to the FX marathon, so figured I’d spend a couple hours making it possible to check for changing language in the longest running TV show of all time.