Posts with tag Digital HumanitiesBack to all posts
I just saw that various Digital Humanists on Twitter were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.
I. The new USIH blogger LD Burnett has a post up expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from New York Times comment threads. It’s a rhetorically appealing position–to set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry there’s some mystification involved–conflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last week–the announcement of the first issue of the ambitiously all-digital Journal of Digital Humanities. So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books.
Let me get back into the blogging swing with a (too long—this is why I can’t handle Twitter, folks) reflection on an offhand comment. Don’t worry, there’s some data stuff in the pipe, maybe including some long-delayed playing with topic models.
Even at the NEH’s Digging into Data conference last weekend, one commenter brought out one of the standard criticisms of digital work—that it doesn’t tell us anything we didn’t know before. The context was some of Gregory Crane’s work in describing shifting word use patterns in Latin over very long time spans (2000 years) at the Perseus Project: Cynthia Damon, from Penn, worried that “being able to represent this as a graph instead by traditional reading is not necessarily a major gain.” That is to say, we already know this; having a chart restate the things any classicist could tell you is less than useful. I might have written down the quote wrong; it doesn’t really matter, because this is a pretty standard response from humanists to computational work, and Damon didn’t press the point as forcefully as others do. Outside the friendly confines of the digital humanities community, we have to deal with it all the time.
All the cool kids are talking about shortcomings in digitized text databases. I don’t have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it’s not just at the margins we’re missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here’s an example.
Shane Landrum (UNLISTED CITATION) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don’t mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.
I’ve been thinking for a while about the transparency of digital infrastructure, and what historians need to know that currently is only available to the digitally curious. They’re occasionally stirred by a project like ngrams to think about the infrastructure, but when that happens they only see the flaws. But those problems—bad OCR, inconsistent metadata, lack of access to original materials—are present to some degree in all our texts.
I wanted to see how well the vector space model of documents I’ve been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you’re sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab’s Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books in LCC subclasses “BF” (psychology) blue, and use red for “QE” (Geology), overlaying them on a chart of the first two principal components like I’ve been using for the last two posts:
I’ve spent a lot of the last week trying to convince Princeton undergrads it’s OK to occasionally disagree with each other, even if they’re not sure they’re right. So let me make one of my notes on one of the places I’ve felt a little bit of skepticism as I try to figure what’s going on with the digital humanities.
In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I’ve called 1922 the year digital history ends before; for the kind of work I want to see, it’s nearly an insuperable barrier, and it’s one I think not enough non-tech-savvy humanists think about. So let me dig in a little.
The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I’ll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.
Patricia Cohen’s new article about the digital humanities doesn’t come with the rafts of crotchety comments the first one did, so unlike last time I’m not in a defensive crouch. To the contrary: I’m thrilled and grateful that Dan Cohen, the main subject of the article, took the time in his moment in the sun to link to me. The article itself is really good, not just because the Cohen-Gibbs Victorian project is so exciting, but because P. Cohen gets some thoughtful comments and the NYT graphic designers, as always, do a great job. So I just want to focus on the Google connection for now, and then I’ll post my versions of the charts the Times published.
Jamie’s been asking for some thoughts on what it takes to do this–statistics backgrounds, etc. I should say that I’m doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don’t think I’m going to do the software review thing here, but there are what look like a _lot _of promising leads at an American Studies blog.
I’ve had “digital humanities” in the blog’s subtitle for a while, but it’s a terribly offputting term. I guess it’s supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn’t appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.
Jamie asked about assignments for students using digital sources. It’s a difficult question.
A couple weeks ago someone referred an undergraduate to me who was interested in using some sort of digital maps for a project on a Cuban emigre writer like the ones I did of Los Angeles German emigres a few years ago. Like most history undergraduates, she didn’t have any programming background, and she didn’t have a really substantial pile of data to work with from the start. For her to do digital history, she’d have to type hundreds of addresses and dates off of letters from the archives, and then learn some sort of GIS software or google maps API, without any clear payoff. No would get much out of forcing her to spend three days playing with databases when she’s really looking at the contents of letters.
Most intensive text analysis is done on heavily maintained sources. I’m using a mess, by contrast, but a much larger one. Partly, I’m doing this tendentiously–I think it’s important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.
One more note on that Grafton quote, which I’ll post below.
“The digital humanities do fantastic things,” said the eminent Princeton historian Anthony Grafton. “I’m a believer in quantification. But I don’t believe quantification can do everything. So much of humanistic scholarship is about interpretation.”
I’m in Moscow now. I still have a few things to post from my layover, but there will be considerably lower volume through Thanksgiving.
I don’t want to comment too much on yesterday (today’s? I can’t tell anymore) article about digital humanities in the New York Times, but a couple e-mail people e-mailed about it. So a couple random points:
All right, let’s put this machine into action. A lot of digital humanities is about visualization, which has its place in teaching, which Jamie asked for more about. Before I do that, though, I want to show some more about how this can be a research tool. Henry asked about the history of the term ‘scientific method.’ I assume he was asking a chart showing its usage over time, but I already have, with the data in hand, a lot of other interesting displays that we can use. This post is a sort of catalog of what some of the low-hanging fruit in text analysis are.
Obviously, I like charts. But I’ve periodically been presenting data as a number of random samples, as well. It’s a technique that can be important for digital humanities analysis. And it’s one that can draw more on the skills in humanistic training, so might help make this sort of work more appealing. In the sciences, an individual data point often has very little meaning on its own–it’s just a set of coordinates. Even in the big education datasets I used to work with, the core facts that I was aggregating up from were generally very dull–one university awarded three degrees in criminal science in 1984, one faculty member earned $55,000 a year. But with language, there’s real meaning embodied in every point, that we’re far better equipped to understand than the computer. The main point of text processing is to act as a sort of extraordinarily stupid and extraordinarily perseverant research assistant, who can bring patterns to our attention but is terrible at telling which patterns are really important. We can’t read everything ourselves, but it’s good to check up periodically–that’s why I do things like see what sort of words are the 300,000th in the language, or what 20 random book titles from the sample are.