You are looking at archived content from my "Bookworm" blog, an experiment that ran from 2014-2016. Not all content may work. For current posts, see here.

Andrew Piper announced yesterday that the McGill text lab is releasing their corpus of modern novels in three languages. One of first thoughts with any corpus is: what existing Bookworm methods might add some value here? It only took about ten minutes to write the code to import it into a bookworm; the challenge is figuring how methods developed for millions of books can be useful on a set of just 450.

Hansard Dec 14 2015

A first pass at understanding the potential of the Hansard corpus through a Bookworm browser.

I’ve divided up the native XML by using the intrinsic speaker tag into a variety of individual speeches.

A “speech” can be very short; on average, each one in the Hansard corpus is 225 words.

2015-11-19

My post on ‘rejecting the gender binary’ showed a way to use word2vec models (or, in theory, any word embedding model) to find paired gendered words–that is, words that mean the same thing except that one is used in the context of women, and the other in the context of men.

My last post provided a general introduction to the new word embedding of language (WEMs), and introduced an R package for easily performing basic operations on them. It was geared mostly towards people in the Digital Humanities community. This post looks more closely at a single word2vec model I’ve trained, on about 14 million reviews of faculty members from ratemyprofessors.com,

To be precise: it is a 500-dimensional skip-gram model with window of about 12 on lowercased, punctuation-free text using the original word2vec C code. I’ve then heavily culled the vocabulary to remove words that usually appear uppercased, on the assumption that they are proper nouns.

The point of this one is to provide a more concrete exploration of how these models can help us think about gendered language. I hope it will be interesting even to people who aren’t interesting in training a machine learning model themselves; there’s code in here, but it’s freely skippable.

Recent advances in vector-space representations of vocabularies have created an extremely interesting set of opportunities for digital humanists. These models, known collectively as word embedding models, may hold nearly as many possibilities for digital humanitists modeling texts as do topic models. Yet although they’re gaining some headway, they remain far less used than other methods (such as modeling a text as a network of words based on co-occurrence) that have considerably less flexibility. “As useful as topic modeling” is a large claim, given that topic models are used so widely. DHers use topic models because it seems at least possible that each individual topic can offer a useful operationalization of some basic and real element of humanities vocabulary: topics (Blei), themes (Jockers), or discourses (Underwood/Rhody).

Or, more tongue in cheek, trade routes (Schmidt)

The word embedding models offer something slightly more abstract, but equally compelling: a spatial analogy to relationships between words. WEMs (to make up for this post a blanket abbreviation for the two major methods)

The convoluted language is because there are two major methods, and no a single algorithm that unites the two most important methods. Word2vec uses neural networks, while the GloVe method works maximizes a function across a word-word matrix. The differences in methods between them aren’t worth going to into in an introduction. Suffice it to say that Word2vec was first, GloVe is more clearly theorized, but they have various tradeoffs in performance and efficacy in building a model. My general take on the literature so far is that whatever differences there are in quality of the final models tend to be swamped by the differences set by choices of hyperparameters.

take an entire corpus, and try to encode the various relations between word into a spatial analogue.

Bookworm D3 layouts Oct 19 2015

There’s no full description of the D3 bookworm package yet, because it’s still something of a moving target.

But Abby Mullen wanted to know what the different possibilities were for charts through the API, so I thought it was time to give a quick tour.

Core chart types

Bookworm 0.4 is now released on github. It contains a number of improvements to the code from over the summer. It makes the existing code much, much more sensible for anyone wanting to build a bookworm on their own collections of texts based on the experience of many using it so far. All the stages: installation, configuration, and testing are now a lot easier. So if you have a collection of texts you wish to explore, I welcome you to test it out. (I’ll explain at more length later, but for the absolute lowest investment of time you can just run a prebuilt bookworm virtual machine using vagrant.)

This post is just kind of playing around in code, rather than any particular argument. It shows the outlines of using the features stored in a Bookworm for all sorts of machine learning, by testing how well a logistic regression classifier can predict IMDB genre based on the subtitles of television episodes.

Movie Geographies Jul 01 2015

I just saw Matt Wilkens’ talk at the Digital Humanities conference on places mentioned in books; I wanted to put up, mostly for him, a quick stab at some of the raw data running the equivalents on my movie bookworm.

This is a quick post to share some ideas for interacting with the data underlying the recent article by Ted Underwood and Jordan Sellers on the pace of change in literary standards for poetry.

Story Time. May 10 2015

Here are some interactives I’ve made in preparation for my talk at the Literary Lab at Stanford on Tuesday on plot arcs in television shows based on underlying language.

This is sort of in lieu of a handout for the talk, so some elements may not make much sense if you aren’t in the room.

The Usenet Archive May 07 2015

Even if you think you don’t know Usenet, you probably do. It’s the Cambrian explosion of the modern Internet, among the first places that an online culture emerged, but modern enough that it can seamlessly blend into the contemporary web. (I was recently trying to work out through Google where I might buy a clavichord in Boston; my hopes were briefly raised about one particular seller until I realized that the modern-looking Google Groups page I was reading was actually a presentation of a discussion from the Usenet archives in 1992.)

Just a day after launching this blog (RSS feed, by the way, is now up here) I came across a perfect little example question to look at. The Guardian ran an article about appearance on teaching evaluations that touches on some issues that my Rate My Professor Bookworm can answer, with a few new interactive charts.

Though more and more outside groups are starting to adopt Bookworm for their own projects, I haven’t yet written quite as much as I’d like about how it should work. This blog is attempt to rectify that, and begin to explain how a combination of blogging software, interactive textual visualizations, and a exploratory data analysis API for bag-of-words models can make it possible to quickly and usefully share texts through a Bookworm installation.