You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Compare and Contrast

Nov 14 2011

I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones Ive been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.

Off the top of my head, I think there are roughly three things that computers may let us do with text so much faster than was previously possible as to qualitatively change research.

  1. Find texts that use words, phrases, or names were interested in.

  2. Compare individual texts or groups of texts against each other.

  3. Classify and cluster texts or words. (Where classifying is assigning texts to predefined groups like US History, and clustering is letting the affinities be only between the works themselves).

These arent, to be sure, completely different. Ive argued before that in some cases, full-text search is best thought of as a way to create a new classification scheme and populating it with books. (Anytime I get fewer than 15 results for a historical subject in a ProQuest newspapers search, I read all of themthe ranking inside them isnt very important). Clustering algorithms are built around models of cross group comparisons; full text searches often have faceted group comparisons. And so on.

But as ideal types, these are different, and in very different places in the digital humanities right now. Everybody knows about number 1; I think theres little doubt that it continues to be the most important tool for most researchers, and rightly so. (It wasnt, so far as I know, helped along the way by digital humanists at all). More recently, theres a lot of attention to 3. Scott Weingart has a good summary/literature review on topic modeling and network analysis this weekI think his synopsis that theyre powerful, widely applicable, easy to use, and difficult to understand a dangerous combination gets it just right, although I wish hed bring the hammer down harder on the danger part. Ive read a fair amount about topic models, implemented a few on text collections Ive built, and I certainly see the appeal: but not necessarily the embrace. Ive also done some work with classification.

In any case: Im worried that in the excitement about clustering, were not sufficiently understanding the element in between: comparisons. Its not as exciting a field as topic modeling or clustering: it doesnt produce much by way of interesting visualizations, and theres not the same density of research in computer science that humanists can piggyback on. At the same time, its not nearly so mature a technology as search. There are a few production quality applications that include some forms of comparisons (WordHoard uses Dunning Log-Likelihood; I can only find relative ratios on the Tapor page). But there isnt widespread adoption, generally used methodologies for search, or anything else like that.

This is a problem, because cross-textual comparison is one of the basic competencies of the humanities, and its one that computers ought to be able to help with. While we do talk historically about clusters and networks and spheres of discourse, I think comparisons are also closer to most traditional work; theres nothing quite so classically historiographical as tracing out the similarities and differences between Democratic and Whig campaign literature, Merovingian and Carolingian statecraft, 1960s and 1980s defenses of American capitalism. These are just what we teach in historyI in fact felt like I was coming up with exam or essay questions writing that last sentence.

So why isnt this a more vibrant area? (Admitting one reason might be: it is, and I just havent done my research. In that case, Id love to hear what Im missing).

I think the biggest reason for this is probably legal-technical, and getting solved. A site like J-Stor (or Bookworm, for that matter) can set up full-text search much easier than it can cross-corpus comparisons; one takes tenths of a second, and the other can take minutes. Minutes isnt very much, of course, and if it worked plenty of humanists would be happy to let their laptops plug away at the problem; but restrictions on downloading texts makes that impossible. Add into the mix all the completely un-digitized texts wed want to include in many comparisons, and there are only a few cases where its possible. Topic modelling and search both work much better with a research model where one centralized research server provides on-demand service to lots of people who dont necessarily understand whats going on behind the scenes.

Another reason is algorithmic. To put it bluntly, Dunning Log-Likelihood doesnt work very well; not only does it over-represent common words, it also finds spurious differences based on one or two texts. Ted Underwood has been exploring some aspects of Mann-Whitney; but it too has its share of flaws, and in some cases, it can be much more difficult or inappropriate to implement. TF-IDF suffers some difficult translation problems when comparing two parts to comparing a part to a whole. I started a few posts on these things, and hopefully, theyll see the light of day. But in general, I get the impression that there isnt a very good all-around corpus comparison tool any scholar could apply to their questions.

I also suspect that there are some cultural-psychological reasons. One of the things thats so appealing about the topic models and the networks is that they alleviate the feeling of being overwhelmed by unstructured information. (Or the print age, or whatever.) Topic models and network graphs put the world in order, which is very reassuring; they also create things that are very cool looking, which is very (too?) important in the web ecosystem where DH lives. (This site tends to a lot of Google image search results to some cluster charts I made in the past, which are certainly among the least clear charts Ive ever posted; theres something about the untangling that sort of puzzle that people find very rewarding.)

Comparisons just create word lists, and they arent as rewarding as topic modelsyou just get a list of differences to sift through. And they dont deal in the same way with the whole corpusyou are much more restricted in what you work with. I think theres a bit of a tendency to think that as long as were using computers to read texts, we might as well do all of the ones that are in good enough shape to work withmoving down for comparisons doesnt tell us much.

I dont see any of these reasons as basically good ones. The inability to get digital texts to work with or text curators who allow multifaceted access is the biggest problem facing textual digital analysis. The lack of good algorithms is just more evidence that this is our problem; that humanists need to be developing expertise and a feel for the data here themselves. And theres certainly no one who defends eye-candy for itself; they (and I) would only point to its usefulness for asking more interesting questions. But comparisons should let us do that too.

Comments:

Ben - great post, and youre right to call me

Scott Weingart - Nov 1, 2011

Ben - great post, and youre right to call me out on not focusing on the dangers of tools that are easy to use and difficult to understand. I think that was one of those threads I planned on picking back up at the end but, instead, entirely forgot.

My knee-jerk reaction to this post was to say, come now, this doesnt yet exist because the first and third are *easy*, whereas genuine comparisons are things people can still do way better than computers. Youve got to take into account pragmatics and semantics, while the others work just fine on syntax.

But of course, both more-explored problems just use syntax because the other avenues arent yet available; they both certainly would benefit from the addition of semantic and pragmatic awareness. Topic modeling certainly tries to. In either case, its not good excuse - if people dont build the tools for us, we ought to be able to build em ourselves.

One of the issues is precisely that comparisons *are* closer to our traditional work - and as such, its a lot easier to tell when the algorithms are insufficient.

When Google popped up, librarians didnt (and still dont) consider it an IR solution, because this is something theyre deeply familiar with, and they realize there are still difficulties to be tackled. Similarly, we humanists will probably take a good deal more convincing with regards to a comparison tool than any other group might.

Scott, Thats a great pointanother way to p

Ben - Nov 1, 2011

Scott,

Thats a great pointanother way to put it is that comparison is something scholars hardly ever relegate to others, while theres a long history of farming search out to undergrads and classification to librarians and archivists.

Im not sure that classification is actually easier than comparison; a lot of my relations with topic modeling have been similar to your hypothetical humanist seeing a comparison. (Come on, I dont want a category of British vs. American spellings, and so forth). Certainly the algorithms are much more complicatedLDA stretches my comprehension, but its easy for me to implement Dunning or Mann-Whitney.

But certainly good comparison may be a lot harder than good classification or search. Thats an important point I hadnt thought about.

The thing I find interesting is we have algorithms that do a good job on unigram searches and on classification, and we dont for comparison. I was talking a non-digital humanist recently who was fascinated by the Amazon statistically improbably phrases.

Interesting discussion, which I overlooked in Nove

Ted Underwood - Feb 2, 2012

Interesting discussion, which I overlooked in November when it happened. I think all of what you and Scott say is right. Comparison is hard to do well, and there are also some thorny questions about what exactly the goal should be questions that you exposed well in unpacking the Dunnings algorithm. There are a range of different ways we could weight comparison, and there may be no single right way to do it thats going to make things very sticky.

The same thing may turn out to be true about topic modeling, incidentally. But right now only a few humanists (mainly Travis Brown, I think!) have looked much beyond generic LDA, so people assume that topic modeling = LDA, which makes matters seem more straightforward than they are