You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

I periodically write about Google Books here, so I thought I’d point out something that I’ve noticed recently that should be concerning to anyone accustomed to treating it as the largest collection of books: it appears that when you use a year constraint on book search, the search index has dramatically constricted to the point of being, essentially, broken.

I did a slightly deeper dive into data about the salaries by college majors while working on my new Atlantic article on the humanities crisis. As I say there, the quality of data about salaries by college major has improved dramatically in the last 8 years. I linked to others’ analysis of the ACS data rather than run my own, but I did some preliminary exploration of salary stuff that may be useful to see.

NOTE 8/23: I’ve written a more thoughtful version of this argument for the Atlantic. They’re not the same, but if you only read one piece, you should read that one.

Back in 2013, I wrote a few blog post arguing that the media was hyperventilating about a “crisis” in the humanities, when, in fact, the long term trends were not especially alarming. I made two claims them: 1. The biggest drop in humanities degrees relative to other degrees in the last 50 years happened between 1970 and 1985, and were steady from 1985 to 2011; as a proportion of the population, humanities majors exploded. 2) The entirety of the long term decline from 1950 to 2010 had to do with the changing majors of women, while men’s humanities interest did not change.

Historians generally acknowledge that both undergraduate and graduate methods training need to teach students how to navigate and understand online searches. See, for example, this recent article in Perspectives.  Google Books is the most important online resource for full-text search; we should have some idea what’s in it.

Matthew Lincoln recently put up a Twitter bot that walks through chains of historical artwork by vector space similarity. https://twitter.com/matthewdlincoln/status/1003690836150792192.
The idea comes from a Google project looking at paths that traverse similar paintings.

This is a blog post I’ve had sitting around in some form for a few years; I wanted to post it today because:

1) It’s about peer review, and it’s peer review week! I just read this nice piece by Ken Wissoker in its defense.
2) There’s a conference on argumentation in Digital History this weekend at George Mason which I couldn’t attend for family reasons but wanted to resonate with at a distance. 

Digging through old census data, I realized that Wikipedia has some really amazing town-level historical population data, particularly for the Northeast, thanks to one editor in particular typing up old census reports by hand. (And also for French communes, but that’s neither here nor there.) I’m working on pulling it into shape for the whole country, but this is the most interesting part.

I’ve been doing a lot of reading about population density cartography recently. With election-map cartography remaining a major issue, there’s been lots of discussion of them: and the “Joy Plot” is currently getting lots of attention.

Robert Leonard has an op-ed in the Times today that includes the following anecdote:

The Library of Congress has released MARC records that I’ll be doing more with over the next several months to understand the books and their classifications. As a first stab, though, I wanted to simply look at the history of how the Library digitized card catalogs to begin with.

One of the interesting things about contemporary data visualization is that the field has a deep sense of its own history, but that “professional” historians haven’t paid a great deal of attention to it yet. That’s changing. I attended a conference at Columbia last weekend about the history of data visualization and data visualization as history. One of the most important strands that emerged was about the cultural conditions necessary to read data visualization. Dancing around many mentions of the canonical figures in the history of datavis (Playfair, Tukey, Tufte) were questions about the underlying cognitive apparatus with which humans absorb data visualization. What makes the designers of visualizations think that some forms of data visualization are better than others? Does that change?

I want to post a quick methodological note on diachronic (and other forms of comparative) word2vec models.

This is a really interesting field right now. Hamilton et al have a nice paper that shows how to track changes using procrustean transformations: as the grad students in my DH class will tell you with some dismay, the web site is all humanists really need to get the gist.

This is a quick digital-humanities public service post with a few sketchy questions about OCR as performed by Google.

When I started working intentionally with computational texts in 2010 or so, I spent a while worrying about the various ways that OCR–optical character recognition–could fail.

Like everyone else, I’ve been churning over the election results all month. Setting aside the important stuff, understanding election results temporally presents an interesting challenge for visualization.

Geographical realignments are common in American history, but they’re difficult to get an aggregate handle on. You can animate a map, but that makes comparison through time difficult. (One with snappy music is here). You can make a bunch of small multiple maps for every given election, but that makes it quite hard to compare a state to itself across periods. You can make a heatmap, but there’s no ability to look regionally if states are in alphabetical order.

I’m pulling this discussion out of the comments thread on Scott Enderle’s blog, because it’s fun. This is the formal statement of what will forever be known as the efficient plot hypothesis for plot arceology. Noble prize in culturomics, here I come.

Word embedding models are kicking up some interesting debates at the confluence of ethics, semantics, computer science, and structuralism. Here I want to lay out some of the elements in one recent place that debate has been taking place inside computer science.

I’ve been chewing on this paper out of Princeton and Bath on bias and word embedding algorithms. (Link is to a blog post description that includes the draft). It stands in an interesting relation to this paper out of BU and Microsoft Research, which presents many similar findings but also a debiasing algorithm similar to (but better than) the one I’d used to find “gendered synonyms” in a gender-neutralized model. (I’ve since gotten a chance to talk in person to the second team, so I’m reflecting primarily on the first paper here).

Debates in the Digital Humanities 2016 is now online, and includes my contribution, “Do Digital Humanists Need to Understand Algorithms?” (As well as a pretty snazzy cover image…) In it I lay out distinction between transformations, which are about states of texts, and algorithms, which are about processes. Put briefly:

Some scientists came up with a list of the 6 core story types. On the surface, this is extremely similar to Matt Jockers’s work from last year. Like Jockers, they use a method for disentangling plots that is based on sentiment analysis, justify it mostly with reference to Kurt Vonnegut, and choose a method for extracting ur-shapes that naturally but opaquely produces harmonic-shaped curves. (Jockers using the Fourier transform, and the authors here use SVD.) I started writing up some thoughts on this two weeks ago, stopped, and then got a media inquiry about the paper so thought I’d post my concerns here. These sort of ramp up from the basic but important (only about 40% of the texts they are using are actually fictional stories) to the big one that ties back into Jockers’s original work; why use sentiment analysis at all? This leads back into a sort of defense of my method of topic trajectories for describing plots and some bigger requests for others working in the field.

I usually keep my mouth shut in the face of the many hilarious errors that crop up in the burgeoning world of datasets for cultural analytics, but this one is too good to pass up. Nature has just published a dataset description paper that appears to devote several paragraphs to describing “center of population” calculations made on the basis of a flat earth.

I started this post with a few digital-humanities posturing paragraphs: if you want to read them, you’ll encounter them eventually. But instead let me just get the point: here’s a trite new category of analysis that wouldn’t be possible without distant reading techniques that produces sometimes charmingly serendipitous results.

A heads-up for those with this blog on their RSS feeds: I’ve just posted a couple things of potential interest on one of the two other blogs (errm) I’m running on my own site.

One, “Vector Space Models for the digital humanities,” describes how a newly improved class of algorithms known as word embedding models work and showcases some of their potential applications for digital humanities researchers.

Mitch Fraas and I have put together a two-part interactive for the Atlantic using Bookworm as a backend to look at the changing language in the State of Union. Yoni Appelbaum, who just took over this week, spearheaded a great team over there including Chris Barna, Libby Bawcombe, Noah Gordon, Betsy Ebersole, and Jennie Rothenberg Gritz who took some of the Bookworm prototypes and built them into a navigable, attractive overall package. Thanks to everyone.

Far and away the most interesting idea of the new government college ratings emerges toward the end of the report. It doesn’t quite square the circle of competing constituencies for the rankings I worries about in my last post, but it gets close. Lots of weight is placed on a single magic model that will predict outcomes regardless of all the confounding factors they raise (differing pay by gender, sex, possibly even degree composition). As an inveterate modeler and data hound, I can see the appeal here. The federal government has far better data than US News and World Report, in the guise of the student loan repayment forms; this data will enable all sorts of useful studies on the effects of everything from home-schooling to early-marriage. I don’t know that anyone is using it yet for the sort of studies it makes possible (do you?), but it sounds like they’re opening the vault just for these college ranking purposes.

Before the holiday, the Department of Education circulated a draft prospectus of the new college rankings they hope to release next year.That afternoon, I wrote a somewhat dyspeptic post on the way that these rankings, like all rankings, will inevitably be gamed. But it’s probably better to bury that off and instead point out a couple looming problems with the system we may be working under soon. The first is that the audience for these rankings is unresolved in a very problematic way; the second is that altogether two much weight is placed on a regression model solving every objection that has been raised. Finally, I’ll lay out my “constructive” solution for salvaging something out of this, which is that rather than use a three-tiered “excellent” - “adequate” - “needs improvement”, everyone would be better served if we switched to a two-tiered “Good”/“Needs Improvement” system. Since this is sort of long, I’ll break it up into three posts: the first is below.

Sometimes it takes time to make a data visualization, and sometimes they just fall out of the data practically by accident. Probably the most viewed thing I’ve ever made, of shipping lines as spaghetti strings, is one of the latter. I’m working to build one of the former for my talk at the American Historical Association out of the Newberry Library’s remarkable Atlas of Historical County Boundaries. But my second ggplot with the set, which I originally did just to make sure the shapefiles were working, was actually interesting. So I thought I’d post it. Here’s the graphic: then the explanation. Click to enlarge.

Note: a somewhat more complete and slightly less colloquial, but eminently more citeable, version of this work is in the Proceedings of the 2015 IEEE International Conference on Big Data. Plus, it was only there that I came around to calling the whole endeavor “plot arceology.”
It’s interesting to look, as I did at my last post, at the plot structure of typical episodes of a TV show as derived through topic models. But while it may help in understanding individual TV shows, the method also shows some promise on a more ambitious goal: understanding the general structural elements that most TV shows and movies draw from. TV and movies scripts are carefully crafted structures: I wrote earlier about how the Simpsons moves away from the school after its first few minutes, for example, and with this larger corpus even individual words frequently show a strong bias towards the front or end of scripts. These crafting shows up in the ways language is distributed through them in time.

The most interesting element of the Bookworm browser for movies I wrote about in my last post here is the possibility to delve into the episodic structure of different TV shows by dividing them up by minutes. On my website, I previously wrote about story structures in the Simpsons and a topic model of movies I made using the general-purpose bookworm topic modeling extension. For a description of the corpus or of topic modeling, see those links.

Screen time! Sep 15 2014

Here’s a very fun, and for some purposes, perhaps, a very useful thing: a Bookworm browser that lets you investigate onscreen language in about 87,000 movies and TV shows, encompassing together over 600 million words. (Go follow that link if you want to investigate yourself).

An FYI, mostly for people following this feed on RSS: I just put up on my home web site a post about applications for the Simpsons Bookworm browser I made. It touches on a bunch of stuff that would usually lead me to post it here. (Really, it hits the Sapping Attention trifecta: a discussion of the best ways of visualizing Dunning Log-Likelihood, cryptic allusions to critical theory; and overly serious discussions of popular TV shows.). But it’s even less proofread and edited than what I usually put here, and I’ve lately been more and more reluctant to post things on a Google site like this, particularly as blogger gets folded more and more into Google Plus. That’s one of the big reasons I don’t post here as much as I used to, honestly. (Another is that I don’t want to worry about embedded javascript). So, head over there if you want to read it.

Right now people in data visualization tend to be interested in their field’s history, and people in digital humanities tend to be fascinated by data visualization. Doing some research in the National Archives in Washington this summer, I came across an early set of rules for graphic presentation by the Bureau of the Census from February 1915. Given those interests, I thought I’d put that list online.

People love to talk about how “practical” different college majors are: and practicality is usually majored in dollars. But those measurements can be very problematic, in ways that might have bad implications for higher education. That’s what this post is about.

Here’s a little irony I’ve been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a “book” is in ways that makes the selection criteria for Ngrams—if it made it into print, it must have _some _significance—completely meaningless.

A map I put up a year and a half ago went viral this winter; it shows the paths taken by ships in the US Maury collection of the ICOADS database. I’ve had several requests for higher-quality versions: I had some up already, but I just put up on Flickr a basically comparable high resolution version. US Maury is “Deck 701” in the ICOADS collection: I also put up charts for all of the other decks with fewer than 3,000,000 points. You can page through them below, or download the high quality versions from Flickr directly. (At the time of posting, you have to click on the three dots to get through to the summaries).

OK: one last post about enrollments, since the statistic that humanities degrees have dropped in half since 1970 is all over the news the last two weeks. This is going to be a bit of a data dump: but there’s a shortage of data on the topic out there, so forgive me.

In my last two posts, I made two claims about that aspect of the humanities “crisis:”

A quick addendum to my post on long-term enrollment trends in the humanities. (This topic seems to have legs, and I have lots more numbers sitting around I find useful, but they’ve got to wait for now).

There was an article in the Wall Street Journal about low enrollments in the humanities yesterday. The heart of the story is that the humanities resemble the late Roman Empire, teetering on a collapse precipitated by their inability to get jobs like those computer scientists can provide. (Never mind that the news hook is a Harvard report about declining enrollments in the humanities, which makes pretty clear that the real problem is students who are drawn to social sciences, not competitition from computer scientists.)

What are the major turning points in history? One way to think about that is to simply look at the most frequent dates used to start or end dissertation periods.* That gives a good sense of the general shape of time.

*For a bit more about how that works, see my post on the years covered by history dissertations: I should note I’m using a better metric now that correctly gets the end year out of text strings like “1848-61.”
Here’s what that list looks like: the most common year used in dissertation titles. It’s extremely spiky–some years are a lot more common than are others.

Here’s some inside baseball: the trends in periodization in history dissertations since the beginning of the American historical profession. A few months ago, Rob Townsend, who until recently kept everyone extremely well informed about professional trends at American Historical Association* sent me the list of all dissertation titles in history the American Historical Association knows about from the last 120 years. (It’s incomplete in some interesting ways, but that’s a topic for another day). It’s textual data. But sometimes the most interesting textual data to analyze quantitatively are the numbers that show up. Using a Bookworm database, I just pulled out from the titles the any years mentioned: that lets us what periods of the past historians have been the most interested in, and what sort of periods they’ve described..

The new issue of the Journal of Digital Humanities is up as of yesterday: it includes an article of mine, “Words Alone,” on the limits of topic modeling. In true JDH practice, it draws on my two previous posts on topic modeling, here and here. If you haven’t read those, the JDH article is now the place to go. (Unless you love reading prose chock full’ve contractions and typos. Then you can stay in the originals.) If you have read them, you might want to know what’s new or why I asked the JDH editors to let me push those two articles together. In the end, the changes ended up being pretty substantial.

Patchwork Libraries Mar 29 2013

The hardest thing about studying massive quantities of digital texts is knowing just what texts you have. This is knowlege that we haven’t been particularly good at collecting, or at valuing.


The post I wrote two years ago about the Google Ngram chart for 02138 (the zip code for Cambridge, MA) remains a touchstone for me because it shows the ways that materiality, copyright, and institutional decisions can produce data artifacts that are at once inscrutable and completely understandable. (Here’s the chart–go read the original post for the explanation.)



Since then, I’ve talked a lot about the need to understand both the classification schemes for individual libraries, and the need to understand the complicated historical provenance of the digital sources we use.

My last post had the aggregate statistics about which parts of the library have more female characters. (Relatively). But in some ways, it’s more interesting to think about the ratio of male and female pronouns in terms of authors whom we already know. So I thought I’d look for the ratios of gendered pronouns in the most-collected authors of the late 19th and early twentieth centuries, to see what comes out.

Now back to some texts for a bit. Last spring, I posted a few times about the possibilities for reading genders in large collections of books. I didn’t follow up because I have some concerns about just what to do with this sort of pronoun data. But after talking about it to Ryan Cordell’s class at Northeastern last week, I wanted to think a little bit more about the representation of male and female subjects in late-19th century texts. Further spurs were Matt Jockers recently posted the pronoun usage in his corpus of novels; Jeana Jorgensen pointed to recent research by Kathleen Ragan that suggests that editorial and teller effects have a massive effect on the gender of protagonists in folk tales. Bookworm gives a great platform for looking at this sort of question.

I’m cross-posting here a piece from my language anachronisms blog, Prochronisms.

It won’t appear the language blog for a week or two, to keep the posting schedule there more regular. But I wanted to put it here now, because it ties directly into the conversation in my last post about whether words are the atomic units of languages. The presumption of some physics inflected linguistics research is that it is. I was putting forward the claim that it’s actually Ngrams of any length. This question is closely tied to the definition of what a ‘word’ is (although as I said in the comments, I think statistical regularities tend to happen at a level that no one would ever call a ‘word,’ however broad a definition they take).

My last post was about how the frustrating imprecisions of language drive humanists towards using statistical aggregates instead of words: this one is about how they drive scientists to treat words as fundamental units even when their own models suggest they should be using something more abstract.

Crossroads Jan 10 2013

Just a quick post to point readers of this blog to my new Atlantic article on anachronisms in Kushner/Spielberg’s Lincoln; and to direct Atlantic readers interested in more anachronisms over to my other blog, Prochronisms, which is currently churning on through the new season of Downton Abbey. (And to stick around here; my advanced market research shows you might like some of the posts about mapping historical shipping routes.)

Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there’s too little skepticism about the technique, I’m venturing to provide it (even with, I’m sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to  ‘topics’ in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.

A stray idea left over from my whaling series: just how much should digital humanists be flocking to military history? Obviously the field is there a bit already: the Digital Scholarship lab at Richmond in particular has a number of interesting Civil War projects, and the Valley of the Shadow is one of the archetypal digital history projects. But it’s possible someone could get a lot of mileage out of doing a lot more.

[Temporary note, March 2015: those arriving from reddit may also be interested in this post, which has a bit more about the specific image and a few more like it.]

Note: this post is part 5 of my series on whaling logs and digital history. For the full overview, click here.

Note: this post is part 4, section 2 of my series on whaling logs and digital history. For the full overview, click here.

Note: this post is part 4 of my series on whaling logs and digital history. For the full overview, click here.

Note: this post is part I of my series on whaling logs and digital history. For the full overview, click here.

Here’s a special post from the archives of my ‘too-boring for prime time’ files. I wrote this a few months ago but didn’t know if anyone needed: but now I’ll pull it out just for Scott Weingart since I saw him estimating word counts using ‘the,’ which is exactly what this post is about. If that sounds boring to you: for heaven’s sake, don’t read any further.

Melville Plots Oct 18 2012

Note: this post is part III of my series on whaling logs and digital history. For the full overview, click here.

Note: this post is part II of my series on whaling logs and digital history. For the full overview, click here.

I’ve been thinking more than usual lately about spatially representing the data in the various Bookworm browsers.

So in this post, I want to do two things:

First, give a quick overview of the geography of the ArXiv. This is interesting in itself–the ArXiv is the most comprehensive source of scientific papers for physics and mathematics, and plays a substantial role in some other fields. And it’s good for me going forward, as a way to build up some code that can be used on other collections.

A follow up on my post from yesterday about whether there’s more history published in times of revolution. I was saying that I thought the dataset Google uses must be counting documents of historical importance as history: because libraries tend to shelve in a way that conflates things that are about history and things that are history.

A quick post about other people’s data, when I should be getting mine in order:

[Edit–I have a new post here with some concrete examples from the US Civil War of the pattern described in this post]

It’s pretty obvious that one of the many problems in studying history by relying on the print record is that writers of books are disproportionately male.

Data can give some structure to this view. Not in the complicated, archival-silences filling way–that’s important, but hard–but just in the most basic sense. How many women were writing books? Do projects on big digital archives only answer, as Katherine Harris asks, “how do men write?” Where were gender barriers strongest, and where weakest? Once we know these sorts of things, it’s easy to do what historians do: read against the grain of archives. It doesn’t matter if they’re digital or not.

We just rolled out a new version of Bookworm (now going under the name “Bookworm Open Library”) that works on the same codebase as the ArXiv Bookworm released last month. The most noticeable changes are a cleaner and more flexible UI (mostly put together for the ArXiv by Neva Cherniavksy and Martin Camacho, and revamped by Neva to work on the OL version), couple with some behind-the-scenes tweaks that should make it easy to add new Bookworms on other sets of texts in the future. But as a little bonus, there’s an additional metadata category in the Open Library Bookworm we’re calling “author gender.”

[The American Antiquarian Society conference in Worcester last weekend had an interesting rider on the conference invitation–they wanted 500 words from each participant on the prospects for independent research libraries. I’m posting that response here.]

Here’s the basic idea:

I saw some historians talking on Twitter about a very nice data visualization of shipping routes in the 18th and 19th centuries on Spatial Analysis. (Which is a great blog–looking through their archives, I think I’ve seen every previous post linked from somewhere else before).

Turning off the TV Apr 04 2012

I’m starting up a new blog, QwiksterProchronism (an obscure near-synonym for ‘anachronism’) for anything I want to post about  TV/movie related anachronisms and historical language. There are two new posts up there right now: on the season premiere of Mad Men and Sunday’s night’s episode.

[The following is a revised version of my talk on the ‘collaboration’ panel at a conference about “Needs and Opportunities for the Research Library in the Digital Age” at the American Antiquarian Society in Worcester last week. Thanks to Paul Erickson for the invitation to attend, and everyone there for a fascinating weekend.]

[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

I’ve got an article up today on the Atlantic’s web site about how Mad Men stacks up against historical language usage. So if you’re reading this blog, go read that.

A quick follow-up on this issue of author gender.

In my last post, I looked at first names as a rough gauge of author gender to see who is missing from libraries. This method has two obvious failings as a way of finding gender:

I just saw that various Digital Humanists on Twitter were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.

I wanted to try to replicate and slightly expand Ted Underwood’s recent discussion of genre formation over time using the Bookworm dataset of Open Library books. I couldn’t, quite, but I want to just post the results and code up here for those who have been following that discussion. Warning: this is a rather dry post.

[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

It’s Monday, so let’s run last night’s episode of Downton Abbey through the anachronism machine. I looked for Downton Abbey anachronisms for the first time last week: using the Google Ngram dataset, I can check every two-word phrase in an episode to see if it’s more common today than then. This 1) lets us find completely anachronistic phrases, which is fun; and 2) lets us see how the language has evolved, and what shows do the best job at it. [Since some people care about this–don’t worry, no plot spoilers below].

I. The new USIH blogger LD Burnett has a post up expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from New York Times comment threads. It’s a rhetorically appealing position–to set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry there’s some mystification involved–conflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last week–the announcement of the first issue of the ambitiously all-digital Journal of Digital Humanities. So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books.

[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

Digital humanists like to talk about what insights about the past big data can bring. So in that spirit, let me talk about Downton Abbey for a minute. The show’s popularity has led many nitpickers to draft up lists of mistakes. Language Loggers Mark Liberman and Ben Zimmer have looked at some idioms that don’t belong for Language Log, NPR and the Boston Globe.) In the best British tradition, the Daily Mail even managed to cast the errors as a sort of scandal. But all of these have relied, so far as I can tell, on finding a phrase or two that sounds a bit off, and checking the online sources for earliest use. This resembles what historians do nowadays; go fishing in the online resources to confirm hypotheses, but never ever start from the digital sources. That would be, as the dowager countess, might say, untoward.

Though I usually work with the Bookworm database of Open Library texts, I’ve been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there’s also a lot more that could be coming out of the Ngrams set than what I’ve seen in the last year.

Another January, another set of hand-wringing about the humanities job market. So, allow me a brief departure from the digital humanities. First, in four paragraphs, the problem with our current understanding of the history job market; and then, in several more, the solution.

Tony Grafton and Jim Grossman launched the latest exchange with what they call a “modest proposal” for expanding professional opportunities for historians. Jesse Lemisch counters that we need to think bigger and mobilize political action. There’s a big and productive disagreement there, but also a deep similarity: both agree there isn’t funding inside the academy for history PhDs to find work, but think we ought to be able to get our hands on money controlled by someone else. Political pressure and encouraging words will unlock vast employment opportunities in the world of museums, archives, and other public history (Grafton) or government funded jobs programs (Lemisch). These are funny places to look for growth in a 21st-century OECD country (perhaps Bill Cronon could take the more obvious route, and make his signature initiative as AHA president creating new tenure-track jobs in the BRICs?) but the higher levels of the profession don’t see much choice but to change the world.

[This is not what I’ll be saying at the AHA on Sunday morning, since I’m participating in a panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story I’d start with to show how much data we have, and how little things can have different meanings at big scales…]

Genre similarities Dec 16 2011

When data exploration produces Christmas-themed charts, that’s a sign it’s time to post again. So here’s a chart and a problem.

First, the problem. One of the things I like about the posts I did on author age and vocabulary change in the spring is that they have two nice dimensions we can watch changes happening in. This captures the fact that language as a whole doesn’t just up and change–things happen among particular groups of people, and the change that results has shape not just in time (it grows, it shrinks) but across those other dimensions as well.

Ted Underwood has been talking up the advantages of the Mann-Whitney test over Dunning’s Log-likelihood, which is currently more widely used. I’m having trouble getting M-W running on large numbers of texts as quickly as I’d like, but I’d say that his basic contention–that Dunning log-likelihood is frequently not the best method–is definitely true, and there’s a lot to like about rank-ordering tests.

I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I’ve been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.

Dunning Amok Nov 10 2011

A few points following up my two posts on corpus comparison using Dunning Log-Likelihood last month. Nur ein stueck Technik.

Theory First Nov 03 2011

Natalie Cecire recently started an important debate about the role of theory in the digital humanities. She’s rightly concerned that the THATcamp motto–“more hack, less yack”–promotes precisely the wrong understanding of what digital methods offer:

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpuses–two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenote–interesting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning’s Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.

We just launched a new website, Bookworm, from the Cultural Observatory. I might have a lot to say about it from different perspectives; but since it was submitted to the DPLA beta sprint, let’s start with the way it helps you find library books.

We’ve been working on making a different type of browser using the Open Library books I’ve been working with to date, and it’s raised a interesting question I want to think through here.

I think many people looking at word countson a large scale right now (myself included) have tended to make a distinction between wordcount data on the one hand, and catalog metadata on the other. (I know I have the phrase “catalog metadata” burned into my reflex vocabulary at this point–I’ve had to edit it out of this very post several times.) The idea is that we’re looking at the history of words or phrases, and the information from library catalogs can help to split or supplement that. So for example, my big concern about the ngrams viewer when it came out was that it included only one form of metadata (publication year) to supplement the word-count data, when it should really have titles, subjects, and so on. But that still assumes that word data–catalog metadata is a useful binary.

Hank wants me to post more, so here’s a little problem I’m working on. I think it’s a good example of how quantitative analysis can help to remind us of old problems, and possibly reveal new ones, with library collections.

My interest in texts as a historian is particularly focused on books in libraries. Used carefully, an academic library is sufficient to answer many important historical questions. (That statement might seem too obvious to utter, but it’s not–the three most important legs of historical research are books, newspapers, and archives, and the archival leg has been lengthening for several decades in a way that tips historians farther into irrelevance.) A fair concern about studies of word frequency is that they can ignore the particular histories of library acquisition patterns–although I think Anita Guerrini takes that point a bit too far in her recent article on culturomics in Miller-McCune. (By the way, the Miller-McCune article on science PhDs is my favorite magazine article of the last couple of years). A corollary benefit, though, is that they help us to start understanding better just what is included in our libraries, both digital and brick.

I mentioned earlier I’ve been rebuilding my database; I’ve also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.

Moving Jul 15 2011

Starting this month, I’m moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and I’m very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:

What's new? Jun 16 2011

Let me get back into the blogging swing with a (too long—this is why I can’t handle Twitter, folks) reflection on an offhand comment. Don’t worry, there’s some data stuff in the pipe, maybe including some long-delayed playing with topic models.

Even at the NEH’s Digging into Data conference last weekend, one commenter brought out one of the standard criticisms of digital work—that it doesn’t tell us anything we didn’t know before. The context was some of Gregory Crane’s work in describing shifting word use patterns in Latin over very long time spans (2000 years) at the Perseus Project: Cynthia Damon, from Penn, worried that “being able to represent this as a graph instead by traditional reading is not necessarily a major gain.” That is to say, we already know this; having a chart restate the things any classicist could tell you is less than useful. I might have written down the quote wrong; it doesn’t really matter, because this is a pretty standard response from humanists to computational work, and Damon didn’t press the point as forcefully as others do. Outside the friendly confines of the digital humanities community, we have to deal with it all the time.

Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn’t happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like “outside” more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.

The 1940 election Apr 18 2011

A couple weeks ago, I wrote about how ancestry.com structured census data for genealogy, not history, and how that limits what historians can do with it. Last week, I got an interesting e-mail from IPUMS, at the Minnesota population center on just that topic:

All the cool kids are talking about shortcomings in digitized text databases. I don’t have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it’s not just at the margins we’re missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here’s an example.

Let’s start with two self-evident facts about how print culture changes over time:

  1. The words that writers use change. Some words flare into usage and then back out; others steadily grow in popularity; others slowly fade out of the language.
  2. The writers using words change. Some writers retire or die, some hit mid-career spurts of productivity, and every year hundreds of new writers burst onto the scene. In the 19th-century US, median author age stays within a few years of 49: that constancy, year after year, means the supply of writers is constantly being replenished from the next generation.

Shane Landrum (UNLISTED CITATION) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don’t mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.

When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was Project Gutenberg. I quickly found out, though, that they didn’t have works for years, somewhat to my surprise. (It’s remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the Victorian Books project, or the Culturomists have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language.

Cronon's politics Mar 28 2011

Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronon’s upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton’s, I don’t think this has much to do with DH). The whole “we are all Bill Cronon” sentiment misses what’s actually interesting. Cronon’s playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.

Author Ages Mar 24 2011

Back from Venice (which is plastered with posters for “Mapping the Republic of Letters,” making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.

I’ve been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They’re occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.

Genres in Motion Feb 22 2011

Here’s an animation of the PCA numbers I’ve been exploring this last week.

I wanted to see how well the vector space model of documents I’ve been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you’re sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab’s Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books  in LCC subclasses “BF” (psychology) blue, and use red for “QE” (Geology), overlaying them on a chart of the first two principal components like I’ve been using for the last two posts:

PCA on years Feb 17 2011

I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here’s an improved (using all my data on the 10,000 most common words) version of that plot:

Fresh set of eyes Feb 14 2011

One of the most important services a computer can provide for us is a different way of reading. It’s fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.

And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress headings classifications are probably the best hierarchical classification of books we’ll ever get. Certainly they’re the best human-done hierarchical classification. It’s literally taken decades for librarians to amass the card catalogs we have now, with their classifications of every book in every university library down to several degrees of specificity. But they’re also a little foreign, at times, and it’s not clear how well they’ll correspond to machine-centric ways of categorizing books. I’ve been playing around with some of the data on LCC headings classes and subclasses with some vague ideas of what it might be useful for and how we can use categorized genre to learn about patterns in intellectual history. This post is the first part of that.

Going it alone Feb 11 2011

I’ve spent a lot of the last week trying to convince Princeton undergrads it’s OK to occasionally disagree with each other, even if they’re not sure they’re right. So let me make one of my notes on one of the places I’ve felt a little bit of skepticism as I try to figure what’s going on with the digital humanities.

Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields I’m interested in for my dissertation by using the Library of Congress classifications for the books. I’m going to start with the difference between psychology and philosophy. I’ve already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.

Technical notes Feb 01 2011

I’m changing several things about my data, so I’m going to describe my system again in case anyone is interested, and so I have a page to link to in the future.

Platform
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That’s mostly because the number and size of files I’m dealing with are so large that I don’t trust an existing program to handle them, and because the existing packages don’t necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn’t necessarily think to implement on my own. Maybe I’ll find some way to get on board with all that later. First, a quick note on the programs:

Open Library has pretty good metadata. I’m using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I’m waiting for some indexes to build, that will give a good chance to figure out just what’s in these digital sources.

I’m trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I’ve been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I’ve avoided blogging the really boring stuff, but I’m going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I’ve called 1922 the year digital history ends before; for the kind of work I want to see, it’s nearly an insuperable barrier, and it’s one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I’ll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.

Cluster Charts Jan 18 2011

I’ll end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.

Because of my primitive search engine, I’ve been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually don’t get:

More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I’m thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I’ve been working with can help improve this sort of search. I’ll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?

Basic Search Jan 06 2011

To my surprise, I built a search engine as a consequence of trying to quantify information about word usage in the books I downloaded from the Internet Archive. Before I move on with the correlations I talked about in my last post, I need to explain a little about that.

Correlations Jan 05 2011

How are words linked in their usage? In a way, that’s the core question of a lot of history. I think we can get a bit of a picture of this, albeit a hazy one, using some numbers. This is the first of two posts about how we can look at connections between discourses.

Any word has a given prominence for any book. Basically, that’s the number of times it appears. (The numbers I give here are my TF-IDF scores, but for practical purposes, they’re basically equivalent to the rate of incidence per book when we look at single words. Things only get tricky when looking at multiple word correlations, which I’m not going to use in this post.) To explain graphically: here’s a chart. Each dot is a book, the x axis is the book’s score for “evolution”, and the y axis is the book’s score for “society.”

I’ve started thinking that there’s a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I’d call them:

  1. Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual reading—finding texts relevant to a topic, doing low level things like counting mentions, etc.
  2. Text Mining: Treating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.
Call numbers Dec 27 2010

I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.

The HathiTrust Bibliographic API is great. What a resource. There are a few odd tricks I had to put in to account for their integrating various catalogs together (Michigan call numbers are filed under MARC 050 (Library of Congress catalog), while California ones are filed under MARC 090 (local catalog), for instance, although they both seem to be basically an LCC scheme). But the openness is fantastic–you just plug in OCLC or LCCN identifiers into a url string to get an xml record. It’s possible to get a lot of OCLCs, in particular, by scraping Internet Archive pages. I haven’t yet found a good way to go the opposite direction, though: from a large number of specially chosen Hathi catalogue items to IA books.

Finding keywords Dec 26 2010

Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems I’ve had with getting useful data out of this approach are:

  1. What words to use? I have 200,000, and processing those would take at least 10 times more RAM than I have (2GB, for the record).
  2. What books to use? I can—and will—apply them across the whole corpus, but I think it’s more useful to use the data to draw distinctions between types of books we know to be interesting.

Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. There’s just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, there’s no reason to use the ngrams data rather than just downloading the original books, because:

Second Principals Dec 23 2010

Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but I’m going to try again. This post is largely a test of whether I can explain principal components analysis to people who don’t know about it so: correct me if you already understand PCA, and let me know me know what’s unclear if you don’t. (Or, it goes without saying, skip it.)

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and you’re pretty much guaranteed an explosion of theories and methods.

Missing humanists Dec 17 2010

(First in a series on yesterday’s Google/Harvard paper in Science and its reception.)

Culturomics Dec 16 2010

Days from when I said “Google Trends for historical terms might be worse than nothing” to the release of “Google ngrams:” 12. So: we’ll get to see!

We all know that the OCR on our digital resources is pretty bad. I’ve often wondered if part of the reason Google doesn’t share its OCR is simply it would show so much ugliness. (A common misreading, ‘tlie’ for ‘the’, gets about 4.6m results in Google books). So how bad is the the internet archive OCR, which I’m using? I’ve started rebuilding my database, and I put in a few checks to get a better idea. Allen already asked some questions in the comments about this, so I thought I’d dump it on to the internet, since there doesn’t seem to be that much out there.

Avoidance tactics Dec 14 2010

Can historical events suppress use of words? Usage of the word ‘panic’ seems to spike down around the bank panics of 1873 and 1893, and maybe 1837 too. I’m pretty confident this is just an artifact of me plugging in a lot of words in to test out how fast the new database is and finding some random noise. There are too many reasons to list: 1857 and 1907 don’t have the pattern, the rebound in 1894 is too fast, etc. It’s only 1873 that really looks abnormal. What do you think:

Capitalist lackeys Dec 12 2010

I’m interested in the ways different words are tied together. That’s sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for “scientific method,” but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. I’m going to think through this staying on “capitalist” as the word of the day. Fair warning: this post is a rambler.

A commenter asked about why I don’t improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I’d like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I’m going to think through what I know, but I’d love any advice on this because it’s really outside my expertise.

First Principals Dec 08 2010

Let me get ahead of myself a little.

For reasons related to my metadata, I had my computer assemble some data on the frequencies of the most common words (I explain why at the end of the post.) But it raises some exciting possibilities using forms of clustering and principal components analysis (PCA); I can’t resist speculating a little bit about what else it can do to help explore ways different languages intersect. With some charts at the bottom.

Back to the Future Dec 06 2010

Maybe this is just Patricia Cohen’s take, but it’s interesting to note that she casts both of the text mining projects she’s put on the Times site this week (Victorian books and the Stanford Literature Lab) as attempts to use modern tools to address questions similar to vast, comprehensive tomes written in the 1950s. There are good reasons for this. Those books are some of the classics that informed the next generation of scholarship in their field; they offer an appealing opportunity to find people who should have read more than they did; and, more than some recent scholarship, they contribute immediately to questions that are of interest outside narrow disciplinary communities. (I think I’ve seen the phrase ‘public intellectuals’ more times in the four days I’ve been on Twitter than in the month before). One of the things that the Times articles highlight is how this work can re-engage a lot of the general public with current humanities scholarship.

Dan asks for some numbers on “capitalism” and “capitalist” similar to the ones on “Darwinism” and “Darwinist” I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.

This verges on unreflective datadumping: but because it’s easy and I think people might find it interesting, I’m going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen’s charts of title word counts. I’ve tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren’t many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends–thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren’t.

Patricia Cohen’s new article about the digital humanities doesn’t come with the rafts of crotchety comments the first one did, so unlike last time I’m not in a defensive crouch. To the contrary: I’m thrilled and grateful that Dan Cohen, the main subject of the article, took the time in his moment in the sun to link to me. The article itself is really good, not just because the Cohen-Gibbs Victorian project is so exciting, but because P. Cohen gets some thoughtful comments and the NYT graphic designers, as always, do a great job. So I just want to focus on the Google connection for now, and then I’ll post my versions of the charts the Times published.

Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researcher’s control. I’ve noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.

Dan Cohen, the hub of all things digital history, in the news and on his blog.

I have my database finally running in a way that lets me quickly select data about books. So now I can start to ask questions that are more interesting than just how overall vocabulary shifted in American publishers. The question is, what sort of questions? I’ll probably start to shift to some of my dissertation stuff, about shifts in verbs modifying “attention”, but there are all sorts of things we can do now. I’m open to suggestions, but here are some random examples:

So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interesting–how does the publishing industry focus in on certain figures to create news or resurgences of interest in them?  I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.

I was starting to write about the implicit model of historical change behind loess curves, which I’ll probably post soon, when I started to think some more about a great counterexample to the gradual change I’m looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.

Do it yourself Dec 02 2010

Jamie’s been asking for some thoughts on what it takes to do this–statistics backgrounds, etc. I should say that I’m doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don’t think I’m going to do the software review thing here, but there are what look like a _lot _of promising leads at an American Studies blog.

I’ve had “digital humanities” in the blog’s subtitle for a while, but it’s a terribly offputting term. I guess it’s supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn’t appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.

Jamie asked about assignments for students using digital sources. It’s a difficult question.

A couple weeks ago someone referred an undergraduate to me who was interested in using some sort of digital maps for a project on a Cuban emigre writer like the ones I did of Los Angeles German emigres a few years ago. Like most history undergraduates, she didn’t have any programming background, and she didn’t have a really substantial pile of data to work with from the start. For her to do digital history, she’d have to type hundreds of addresses and dates off of letters from the archives, and then learn some sort of GIS software or google maps API, without any clear payoff. No would get much out of forcing her to spend three days playing with databases when she’s really looking at the contents of letters.

Mostly a note to myself:

I think genre data would be helpful in all sorts of ways–tracking evolutionary language through different sciences, say, or finding what discourses are the earliest to use certain constructions like “focus attention.” The Internet Archive books have no genre information in their metadata, for the most part. The genre data I think I want to use would Library of Congress call numbers–that divides up books in all sorts of ways at various levels that I could parse. It’s tricky to get from one to the other, though. I could try to hit the LOC catalog with a script that searches for title, author and year from the metadata I do have, but that would miss a lot and maybe have false positives, plus the LOC catalog is sort of tough to machine-query. Or I could try to run a completely statistical clustering, but I don’t trust that that would come out with categories that correspond to ones in common use. Some sort of hybrid method might be best–just a quick sketch below.

Top ten authors Nov 28 2010

Most intensive text analysis is done on heavily maintained sources. I’m using a mess, by contrast, but a much larger one. Partly, I’m doing this tendentiously–I think it’s important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.

In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it’s possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven’t seen much using words, though: but it works fairly well. I thought it might help answer Hank’s question about the difference between evolutionism and darwinism, but, as you’ll see, that distinction seems to be a little too fine for now.

What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between “Darwinism” and evolutionism, so:

Abraham Lincoln invented Thanksgiving. And I suppose this might be a good way to prove to more literal-minded students that the Victorian invention of tradition really happened. Other than that, I don’t know what this means.

This is the second post on ways to measure connections—or more precisely, distance—between words by looking at how often they appear together in books. These are a little dry, and the payoff doesn’t come for a while, so let me remind you of the payoff (after which you can bail on this post). I’m trying to create some simple methods that will work well with historical texts to see relations between words—what words are used in similar semantic contexts, what groups of words tend to appear together. First I’ll apply them to the isms, and then we’ll put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?

Links between words Nov 23 2010

Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase “scientific method”--the percentage of occurrences of a word that occur with another phrase. I’ve been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicated–I never posted anything from Russia because I couldn’t get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Here’s the sentence:

More on Grafton Nov 18 2010

One more note on that Grafton quote, which I’ll post below.

“The digital humanities do fantastic things,” said the eminent Princeton historian Anthony Grafton. “I’m a believer in quantification. But I don’t believe quantification can do everything. So much of humanistic scholarship is about interpretation.”

Moscow and NyTimes Nov 17 2010

I’m in Moscow now. I still have a few things to post from my layover, but there will be considerably lower volume through Thanksgiving.

I don’t want to comment too much on yesterday (today’s? I can’t tell anymore) article about digital humanities in the New York Times, but a couple e-mail people e-mailed about it. So a couple random points:

Lumpy words Nov 17 2010

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: 14.0px}

What are the most distinctive words in the nineteenth century? That’s an impossible question, of course. But as I started to say in my first post about bookcounts, [link] we can find something useful–the words that are most concentrated in specific texts. Some words appear at about the same rate in all books, while some are more highly concentrated in particular books. And historically, the words that are more highly concentrated may be more specific in their meanings–at the very least, they might help us to analyze genre or other forms of contextual distribution.

Isms and ists Nov 15 2010

Hank asked for a couple of charts in the comments, so I thought I’d oblige. Since I’m starting to feel they’re better at tracking the permeation of concepts, we’ll use appearances per 1000 books as the y axis:

I’m going to keep looking at the list of isms, because a) they’re fun; and b) the methods we use on them can be used on any group of words–for example, ones that we find are highly tied to evolution. So, let’s use them as a test case for one of the questions I started out with: how can we find similarities in the historical patterns of emergence and submergence of words?

Here’s a fun way of using this dataset to convey a lot of historical information. I took all the 414 words that end in ism in my database, and plotted them by the year in which they peaked,* with the size proportional to their use at peak. I’m going to think about how to make it flashier, but it’s pretty interesting as it is. Sample below, and full chart after the break.

Infrastructure Nov 13 2010

It’s time for another bookkeeping post. Read below if you want to know about changes I’m making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princeton’s supercomputer time, and why I didn’t just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.

Back to Darwin Nov 13 2010

Henry asks in the comments whether the decline in evolutionary thought in the 1890s is the “‘Eclipse of Darwinism,’ rise or prominence of neo-Lamarckians and saltationism and kooky discussions of hereditary mechanisms?” Let’s take a look, with our new and improved data (and better charts, too, compared to earlier in the week–any suggestions on design?). First,three words very closely tied to the theory of natural selection.

All right, let’s put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term ‘scientific method.’ I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.

Bookcounts are in Nov 11 2010

I now have counts for the number of books a word appears in, as well as the number of times it appeared. Just as I hoped, it gives a new perspective on a lot of the questions we looked at already. That telephone-telegraph-railroad chart, in particular, has a lot of interesting differences. But before I get into that, probably later today, I want to step back and think about what we can learn from the contrast between between wordcounts and bookcounts. (I’m just going to call them bookcounts–I hope that’s a clear enough phrase).

Obviously, I like charts. But I’ve periodically been presenting data as a number of random samples, as well.  It’s a technique that can be important for digital humanities analysis. And it’s one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its own–it’s just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dull–one university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, there’s real meaning embodied in every point, that we’re far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We can’t read everything ourselves, but it’s good to check up periodically–that’s why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.

Here’s what googling that question will tell you: about 400,000 words in the big dictionaries (OED, Webster’s); but including technical vocabulary, a million, give or take a few hundred thousand. But for my poor computer, that’s too many, for reasons too technical to go into here. Suffice it to say that I’m asking this question for mundane reasons, but the answer is kind of interesting anyway. No real historical questions in this post, though–I’ll put the only big thought I have about it in another post later tonight.

I can’t resist making a few more comments on that technologies graph that I laid out. I’m going to add a few thousand more books to the counts overnight, so I won’t make any new charts until tomorrow, but look at this one again.

An anonymous correspondent says:

You mention in the post about evolution & efficiency that “Offhand, the evolution curve looks more the ones I see for technologies, while the efficiency curve resembles news events.”

That’s a very interesting observation, and possibly a very important one if it’s original to you, and can be substantiated. Do you have an example of a tech vs news event graph? Something like lightbulbs or batteris vs the Spanish American war might provide a good test case.

Also, do you think there might be changes in how these graphs play out over a century? That is, do news events remain separate from tech stuff? Tech changes these days are often news events themselves, and distributed similarly across media.

I think another way to put the tech vs news event could be in terms of the kind of event it is: structural change vs superficial, mid-range event vs short-term.

Anyhow, a very interesting idea, of using the visual pattern to recognize and characterize a change. While I think your emphasis on the teaching angle (rather than research) is spot on, this could be one application of these techniques where it’d be more useful in research.

Back to Basics Nov 08 2010

I’ve rushed straight into applications here without taking much time to look at the data I’m working with. So let me take a minute to describe the set and how I’m trimming it.

The Internet Archive has copies of some unspecified percentage of the public domain books for which google books has released pdfs. They have done OCR (their own, I think, not google’s) for most of them. The metadata isn’t great, but it’s usable–the same for the OCR. In total, they have 900,000 books from the Google collection–Dan Cohen claims to have 1.2 million from English publishers alone, so we’re looking at some sort of a sample. The physical books are from major research libraries–Harvard, Michigan, etc.

Collocation Nov 07 2010

A collection as large as the Internet Archive’s OCR database means I have to think through what I want well in advance of doing it. I’m only using a small subset of their 900,000 Google-scanned books, but that’s still 16 gigabytes–it takes a couple hours just to get my baseline count of the 200,000 most common words. I could probably improve a lot of my search time through some more sophisticated database management, but I’ll still have to figure out what sort of relations are worth looking for. So what are some?

Taylor vs. Darwin Nov 07 2010

Let’s start with just some of the basic wordcount results. Dan Cohen posted some similar things for the Victorian period on his blog, and used the numbers mostly to test hypotheses about change over time. I can give you a lot more like that (I confirmed for someone, though not as neatly as he’d probably like, that ‘business’ became a much more prevalent word through the 19C). But as Cohen implies, such charts can be cooler than they are illuminating.

Intro Nov 06 2010

I’m going to start using this blog to work through some issues in finding useful applications for digital history. (Interesting applications? Applications at all?)

Right now, that means trying to figure out how to use large amounts of textual data to draw conclusions or refine questions. I currently have the Internet Archive’s OCRed text files for about 30,000 books by large American publishers from 1830 to 1920. I’ve done this partly to help with my own research, and partly to try a different way of thinking about history and the texts we read.

If only it were that easy.
(Norfolk Journal and Guide, 1953)