Posts with tag Building a CorpusBack to all posts
Shane Landrum (UNLISTED CITATION) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don’t mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.
I’ve been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They’re occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.
I’m changing several things about my data, so I’m going to describe my system again in case anyone is interested, and so I have a page to link to in the future.
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That’s mostly because the number and size of files I’m dealing with are so large that I don’t trust an existing program to handle them, and because the existing packages don’t necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn’t necessarily think to implement on my own. Maybe I’ll find some way to get on board with all that later. First, a quick note on the programs:
Open Library has pretty good metadata. I’m using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I’m waiting for some indexes to build, that will give a good chance to figure out just what’s in these digital sources.
I’m trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I’ve been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I’ve avoided blogging the really boring stuff, but I’m going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.
A commenter asked about why I don’t improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I’d like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I’m going to think through what I know, but I’d love any advice on this because it’s really outside my expertise.
Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researcher’s control. I’ve noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.
Mostly a note to myself:
I think genre data would be helpful in all sorts of ways–tracking evolutionary language through different sciences, say, or finding what discourses are the earliest to use certain constructions like “focus attention.” The Internet Archive books have no genre information in their metadata, for the most part. The genre data I think I want to use would Library of Congress call numbers–that divides up books in all sorts of ways at various levels that I could parse. It’s tricky to get from one to the other, though. I could try to hit the LOC catalog with a script that searches for title, author and year from the metadata I do have, but that would miss a lot and maybe have false positives, plus the LOC catalog is sort of tough to machine-query. Or I could try to run a completely statistical clustering, but I don’t trust that that would come out with categories that correspond to ones in common use. Some sort of hybrid method might be best–just a quick sketch below.
Most intensive text analysis is done on heavily maintained sources. I’m using a mess, by contrast, but a much larger one. Partly, I’m doing this tendentiously–I think it’s important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.
It’s time for another bookkeeping post. Read below if you want to know about changes I’m making and contemplating to software and data structures, which I ought to put in public somewhere. Henry posted questions in the comments earlier about whether we use Princeton’s supercomputer time, and why I didn’t just create a text scatter chart for evolution like the one I made for scientific method. This answers those questions. It also explains why I continue to drag my feet on letting us segment counts by some sort of genre, which would be very useful.
Obviously, I like charts. But I’ve periodically been presenting data as a number of random samples, as well. It’s a technique that can be important for digital humanities analysis. And it’s one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its own–it’s just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dull–one university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, there’s real meaning embodied in every point, that we’re far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We can’t read everything ourselves, but it’s good to check up periodically–that’s why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.
I’ve rushed straight into applications here without taking much time to look at the data I’m working with. So let me take a minute to describe the set and how I’m trimming it.
The Internet Archive has copies of some unspecified percentage of the public domain books for which google books has released pdfs. They have done OCR (their own, I think, not google’s) for most of them. The metadata isn’t great, but it’s usable–the same for the OCR. In total, they have 900,000 books from the Google collection–Dan Cohen claims to have 1.2 million from English publishers alone, so we’re looking at some sort of a sample. The physical books are from major research libraries–Harvard, Michigan, etc.