Posts with tag FeaturedBack to all posts
Note: this post is part I of my series on whaling logs and digital history. For the full overview, click here.
[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]
Digital humanists like to talk about what insights about the past big data can bring. So in that spirit, let me talk about Downton Abbey for a minute. The show’s popularity has led many nitpickers to draft up lists of mistakes. Language Loggers Mark Liberman and Ben Zimmer have looked at some idioms that don’t belong for Language Log, NPR and the Boston Globe.) In the best British tradition, the Daily Mail even managed to cast the errors as a sort of scandal. But all of these have relied, so far as I can tell, on finding a phrase or two that sounds a bit off, and checking the online sources for earliest use. This resembles what historians do nowadays; go fishing in the online resources to confirm hypotheses, but never ever start from the digital sources. That would be, as the dowager countess, might say, untoward.
Though I usually work with the Bookworm database of Open Library texts, I’ve been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there’s also a lot more that could be coming out of the Ngrams set than what I’ve seen in the last year.
[This is not what I’ll be saying at the AHA on Sunday morning, since I’m participating in a panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story I’d start with to show how much data we have, and how little things can have different meanings at big scales…]
I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I’ve been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.
Let’s start with two self-evident facts about how print culture changes over time:
- The words that writers use change. Some words flare into usage and then back out; others steadily grow in popularity; others slowly fade out of the language.
- The writers using words change. Some writers retire or die, some hit mid-career spurts of productivity, and every year hundreds of new writers burst onto the scene. In the 19th-century US, median author age stays within a few years of 49: that constancy, year after year, means the supply of writers is constantly being replenished from the next generation.
I’ve been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They’re occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.
One of the most important services a computer can provide for us is a different way of reading. It’s fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.
And though a text can be a book, it can also be something much larger. Take library call numbers. Library of Congress headings classifications are probably the best hierarchical classification of books we’ll ever get. Certainly they’re the best human-done hierarchical classification. It’s literally taken decades for librarians to amass the card catalogs we have now, with their classifications of every book in every university library down to several degrees of specificity. But they’re also a little foreign, at times, and it’s not clear how well they’ll correspond to machine-centric ways of categorizing books. I’ve been playing around with some of the data on LCC headings classes and subclasses with some vague ideas of what it might be useful for and how we can use categorized genre to learn about patterns in intellectual history. This post is the first part of that.
I’m changing several things about my data, so I’m going to describe my system again in case anyone is interested, and so I have a page to link to in the future.
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That’s mostly because the number and size of files I’m dealing with are so large that I don’t trust an existing program to handle them, and because the existing packages don’t necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn’t necessarily think to implement on my own. Maybe I’ll find some way to get on board with all that later. First, a quick note on the programs:
In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I’ve called 1922 the year digital history ends before; for the kind of work I want to see, it’s nearly an insuperable barrier, and it’s one I think not enough non-tech-savvy humanists think about. So let me dig in a little.
More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I’m thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I’ve been working with can help improve this sort of search. I’ll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?
I’ve started thinking that there’s a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I’d call them:
- Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual reading—finding texts relevant to a topic, doing low level things like counting mentions, etc.
- Text Mining: Treating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.
This verges on unreflective datadumping: but because it’s easy and I think people might find it interesting, I’m going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen’s charts of title word counts. I’ve tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren’t many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends–thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren’t.
I was starting to write about the implicit model of historical change behind loess curves, which I’ll probably post soon, when I started to think some more about a great counterexample to the gradual change I’m looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.
I’ve had “digital humanities” in the blog’s subtitle for a while, but it’s a terribly offputting term. I guess it’s supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn’t appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.
What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between “Darwinism” and evolutionism, so:
Here’s a fun way of using this dataset to convey a lot of historical information. I took all the 414 words that end in ism in my database, and plotted them by the year in which they peaked,* with the size proportional to their use at peak. I’m going to think about how to make it flashier, but it’s pretty interesting as it is. Sample below, and full chart after the break.
All right, let’s put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term ‘scientific method.’ I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.
I’ve rushed straight into applications here without taking much time to look at the data I’m working with. So let me take a minute to describe the set and how I’m trimming it.
The Internet Archive has copies of some unspecified percentage of the public domain books for which google books has released pdfs. They have done OCR (their own, I think, not google’s) for most of them. The metadata isn’t great, but it’s usable–the same for the OCR. In total, they have 900,000 books from the Google collection–Dan Cohen claims to have 1.2 million from English publishers alone, so we’re looking at some sort of a sample. The physical books are from major research libraries–Harvard, Michigan, etc.
Let’s start with just some of the basic wordcount results. Dan Cohen posted some similar things for the Victorian period on his blog, and used the numbers mostly to test hypotheses about change over time. I can give you a lot more like that (I confirmed for someone, though not as neatly as he’d probably like, that ‘business’ became a much more prevalent word through the 19C). But as Cohen implies, such charts can be cooler than they are illuminating.