Posts with tag NgramsBack to all posts
Here’s a little irony I’ve been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a “book” is in ways that makes the selection criteria for Ngrams—if it made it into print, it must have _some _significance—completely meaningless.
Starting this month, I’m moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and I’m very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:
All the cool kids are talking about shortcomings in digitized text databases. I don’t have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it’s not just at the margins we’re missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here’s an example.
In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I’ve called 1922 the year digital history ends before; for the kind of work I want to see, it’s nearly an insuperable barrier, and it’s one I think not enough non-tech-savvy humanists think about. So let me dig in a little.
The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I’ll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.
I’ve started thinking that there’s a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I’d call them:
- Assisted Reading: Using a computer as a means of targeting and enhancing traditional textual reading—finding texts relevant to a topic, doing low level things like counting mentions, etc.
- Text Mining: Treating texts as data sources to chopped up entirely and recast into new forms like charts of word use or graphics of information exchange that, themselves, require a sort of historical reading.
Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. There’s just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, there’s no reason to use the ngrams data rather than just downloading the original books, because:
I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.
As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and you’re pretty much guaranteed an explosion of theories and methods.