Posts with tag MetadataBack to all posts
All the cool kids are talking about shortcomings in digitized text databases. I don’t have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it’s not just at the margins we’re missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here’s an example.
I’m changing several things about my data, so I’m going to describe my system again in case anyone is interested, and so I have a page to link to in the future.
Everything is done using MySQL, Perl, and R. These are all general computing tools, not the specific digital humanities or text processing ones that various people have contributed over the years. That’s mostly because the number and size of files I’m dealing with are so large that I don’t trust an existing program to handle them, and because the existing packages don’t necessarily have implementations for the patterns of change over time I want as a historian. I feel bad about not using existing tools, because the collaboration and exchange of tools is one of the major selling points of the digital humanities right now, and something like Voyeur or MONK has a lot of features I wouldn’t necessarily think to implement on my own. Maybe I’ll find some way to get on board with all that later. First, a quick note on the programs:
I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.
The HathiTrust Bibliographic API is great. What a resource. There are a few odd tricks I had to put in to account for their integrating various catalogs together (Michigan call numbers are filed under MARC 050 (Library of Congress catalog), while California ones are filed under MARC 090 (local catalog), for instance, although they both seem to be basically an LCC scheme). But the openness is fantastic–you just plug in OCLC or LCCN identifiers into a url string to get an xml record. It’s possible to get a lot of OCLCs, in particular, by scraping Internet Archive pages. I haven’t yet found a good way to go the opposite direction, though: from a large number of specially chosen Hathi catalogue items to IA books.
A commenter asked about why I don’t improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I’d like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I’m going to think through what I know, but I’d love any advice on this because it’s really outside my expertise.