You are looking at archived content from my "Bookworm" blog, an experiment that ran from 2014-2016. Not all content may work. For current posts, see here.

Posts with tag Word_Embeddings

← Back to all posts

Dec 10 2015

Sketching out orthographic normalization with word2vec

2015-11-19

Oct 30 2015

Rejecting the gender binary: a vector-space operation

My last post provided a general introduction to the new word embedding of language (WEMs), and introduced an R package for easily performing basic operations on them. It was geared mostly towards people in the Digital Humanities community. This post looks more closely at a single word2vec model I’ve trained, on about 14 million reviews of faculty members from ratemyprofessors.com,¹ The point of this one is to provide a more concrete exploration of how these models can help us think about gendered language. I hope it will be interesting even to people who aren’t interesting in training a machine learning model themselves; there’s code in here, but it’s freely skippable.

Oct 25 2015

Word Embeddings for the Digital Humanities

Recent advances in vector-space representations of vocabularies have created an extremely interesting set of opportunities for digital humanists. These models, known collectively as word embedding models, may hold nearly as many possibilities for digital humanitists modeling texts as do topic models. Yet although they’re gaining some headway, they remain far less used than other methods (such as modeling a text as a network of words based on co-occurrence) that have considerably less flexibility. “As useful as topic modeling” is a large claim, given that topic models are used so widely. DHers use topic models because it seems at least possible that each individual topic can offer a useful operationalization of some basic and real element of humanities vocabulary: topics (Blei), themes (Jockers), or discourses (Underwood/Rhody).¹ The word embedding models offer something slightly more abstract, but equally compelling: a spatial analogy to relationships between words. WEMs (to make up for this post a blanket abbreviation for the two major methods)² take an entire corpus, and try to encode the various relations between word into a spatial analogue.