You are looking at archived content from my "Bookworm" blog, an experiment that ran from 2014-2016. Not all content may work. For current posts, see here.

Posts with tag R

Back to all posts


My post on ‘rejecting the gender binary’ showed a way to use word2vec models (or, in theory, any word embedding model) to find paired gendered words–that is, words that mean the same thing except that one is used in the context of women, and the other in the context of men.

My last post provided a general introduction to the new word embedding of language (WEMs), and introduced an R package for easily performing basic operations on them. It was geared mostly towards people in the Digital Humanities community. This post looks more closely at a single word2vec model I’ve trained, on about 14 million reviews of faculty members from,

To be precise: it is a 500-dimensional skip-gram model with window of about 12 on lowercased, punctuation-free text using the original word2vec C code. I’ve then heavily culled the vocabulary to remove words that usually appear uppercased, on the assumption that they are proper nouns.

The point of this one is to provide a more concrete exploration of how these models can help us think about gendered language. I hope it will be interesting even to people who aren’t interesting in training a machine learning model themselves; there’s code in here, but it’s freely skippable.

Recent advances in vector-space representations of vocabularies have created an extremely interesting set of opportunities for digital humanists. These models, known collectively as word embedding models, may hold nearly as many possibilities for digital humanitists modeling texts as do topic models. Yet although they’re gaining some headway, they remain far less used than other methods (such as modeling a text as a network of words based on co-occurrence) that have considerably less flexibility. “As useful as topic modeling” is a large claim, given that topic models are used so widely. DHers use topic models because it seems at least possible that each individual topic can offer a useful operationalization of some basic and real element of humanities vocabulary: topics (Blei), themes (Jockers), or discourses (Underwood/Rhody).

Or, more tongue in cheek, trade routes (Schmidt)

The word embedding models offer something slightly more abstract, but equally compelling: a spatial analogy to relationships between words. WEMs (to make up for this post a blanket abbreviation for the two major methods)

The convoluted language is because there are two major methods, and no a single algorithm that unites the two most important methods. Word2vec uses neural networks, while the GloVe method works maximizes a function across a word-word matrix. The differences in methods between them aren’t worth going to into in an introduction. Suffice it to say that Word2vec was first, GloVe is more clearly theorized, but they have various tradeoffs in performance and efficacy in building a model. My general take on the literature so far is that whatever differences there are in quality of the final models tend to be swamped by the differences set by choices of hyperparameters.

take an entire corpus, and try to encode the various relations between word into a spatial analogue.