Ben Schmidt

You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Feb 10 2019

How badly is Google Books search broken, and why?

I periodically write about Google Books here, so I thought I’d point out something that I’ve noticed recently that should be concerning to anyone accustomed to treating it as the largest collection of books: it appears that when you use a year constraint on book search, the search index has dramatically constricted to the point of being, essentially, broken.

Aug 23 2018

Some preliminary analysis of the Texas salary-by-major data.

I did a slightly deeper dive into data about the salaries by college majors while working on my new Atlantic article on the humanities crisis. As I say there, the quality of data about salaries by college major has improved dramatically in the last 8 years. I linked to others’ analysis of the ACS data rather than run my own, but I did some preliminary exploration of salary stuff that may be useful to see.

Jul 27 2018

Mea culpa: there *is* a crisis in the humanities

NOTE 8/23: I’ve written a more thoughtful version of this argument for the Atlantic. They’re not the same, but if you only read one piece, you should read that one.

Back in 2013, I wrote a few blog post arguing that the media was hyperventilating about a “crisis” in the humanities, when, in fact, the long term trends were not especially alarming. I made two claims them: 1. The biggest drop in humanities degrees relative to other degrees in the last 50 years happened between 1970 and 1985, and were steady from 1985 to 2011; as a proportion of the population, humanities majors exploded. 2) The entirety of the long term decline from 1950 to 2010 had to do with the changing majors of women, while men’s humanities interest did not change.

Jul 10 2018

Google Books and the open web.

Historians generally acknowledge that both undergraduate and graduate methods training need to teach students how to navigate and understand online searches. See, for example, this recent article in Perspectives. Google Books is the most important online resource for full-text search; we should have some idea what’s in it.

Jun 13 2018

Meaning chains with word embeddings

Matthew Lincoln recently put up a Twitter bot that walks through chains of historical artwork by vector space similarity. https://twitter.com/matthewdlincoln/status/1003690836150792192.
The idea comes from a Google project looking at paths that traverse similar paintings.

Sep 15 2017

"Peer review" is younger than you think. Does that mean it can go away?

This is a blog post I’ve had sitting around in some form for a few years; I wanted to post it today because:

Jul 24 2017

Population Density 2: Old and New New England

Digging through old census data, I realized that Wikipedia has some really amazing town-level historical population data, particularly for the Northeast, thanks to one editor in particular typing up old census reports by hand. (And also for French communes, but that’s neither here nor there.) I’m working on pulling it into shape for the whole country, but this is the most interesting part.

Jul 11 2017

Population Density 1: Do cities have a land area? And a literal use of the Joy Division map

I’ve been doing a lot of reading about population density cartography recently. With election-map cartography remaining a major issue, there’s been lots of discussion of them: and the “Joy Plot” is currently getting lots of attention.

Jul 05 2017

What is described as belonging to the "public" versus the "government?"

Robert Leonard has an op-ed in the Times today that includes the following anecdote:

May 16 2017

A brief visual history of MARC cataloging at the Library of Congress.

The Library of Congress has released MARC records that I’ll be doing more with over the next several months to understand the books and their classifications. As a first stab, though, I wanted to simply look at the history of how the Library digitized card catalogs to begin with.

Apr 14 2017

The history of looking at data visualizations

One of the interesting things about contemporary data visualization is that the field has a deep sense of its own history, but that “professional” historians haven’t paid a great deal of attention to it yet. That’s changing. I attended a conference at Columbia last weekend about the history of data visualization and data visualization as history. One of the most important strands that emerged was about the cultural conditions necessary to read data visualization. Dancing around many mentions of the canonical figures in the history of datavis (Playfair, Tukey, Tufte) were questions about the underlying cognitive apparatus with which humans absorb data visualization. What makes the designers of visualizations think that some forms of data visualization are better than others? Does that change?

Dec 23 2016

Some notes on corpora for diachronic word2vec

I want to post a quick methodological note on diachronic (and other forms of comparative) word2vec models.

Dec 20 2016

OCR failures in 2016

This is a quick digital-humanities public service post with a few sketchy questions about OCR as performed by Google.

Dec 01 2016

A 192-year heatmap of presidential elections with a y axis ordering you have to see to believe

Like everyone else, I’ve been churning over the election results all month. Setting aside the important stuff, understanding election results temporally presents an interesting challenge for visualization.

Sep 09 2016

The efficient plots hypothesis

I’m pulling this discussion out of the comments thread on Scott Enderle’s blog, because it’s fun. This is the formal statement of what will forever be known as the efficient plot hypothesis for plot arceology. Noble prize in culturomics, here I come.

Aug 29 2016

Language is biased. What should engineers do?

Word embedding models are kicking up some interesting debates at the confluence of ethics, semantics, computer science, and structuralism. Here I want to lay out some of the elements in one recent place that debate has been taking place inside computer science.

Jul 20 2016

Why Digital Humanists don't need to understand algorithms, but do need to understand transformations

Debates in the Digital Humanities 2016 is now online, and includes my contribution, “Do Digital Humanists Need to Understand Algorithms?” (As well as a pretty snazzy cover image…) In it I lay out distinction between transformations, which are about states of texts, and algorithms, which are about processes. Put briefly:

Jul 18 2016

Plot arceology 2016: emotion and tension

Some scientists came up with a list of the 6 core story types. On the surface, this is extremely similar to Matt Jockers’s work from last year. Like Jockers, they use a method for disentangling plots that is based on sentiment analysis, justify it mostly with reference to Kurt Vonnegut, and choose a method for extracting ur-shapes that naturally but opaquely produces harmonic-shaped curves. (Jockers using the Fourier transform, and the authors here use SVD.) I started writing up some thoughts on this two weeks ago, stopped, and then got a media inquiry about the paper so thought I’d post my concerns here. These sort of ramp up from the basic but important (only about 40% of the texts they are using are actually fictional stories) to the big one that ties back into Jockers’s original work; why use sentiment analysis at all? This leads back into a sort of defense of my method of topic trajectories for describing plots and some bigger requests for others working in the field.

Jul 05 2016

Nature publishes flat-earth research paper

I usually keep my mouth shut in the face of the many hilarious errors that crop up in the burgeoning world of datasets for cultural analytics, but this one is too good to pass up. Nature has just published a dataset description paper that appears to devote several paragraphs to describing “center of population” calculations made on the basis of a flat earth.

May 30 2016

Literary Dopplegängers and interestingness

I started this post with a few digital-humanities posturing paragraphs: if you want to read them, you’ll encounter them eventually. But instead let me just get the point: here’s a trite new category of analysis that wouldn’t be possible without distant reading techniques that produces sometimes charmingly serendipitous results.

Nov 03 2015

Word embedding models

A heads-up for those with this blog on their RSS feeds: I’ve just posted a couple things of potential interest on one of the two other blogs (errm) I’m running on my own site.

Jan 19 2015

State of the Union--and corpus comparison.

Mitch Fraas and I have put together a two-part interactive for the Atlantic using Bookworm as a backend to look at the changing language in the State of Union. Yoni Appelbaum, who just took over this week, spearheaded a great team over there including Chris Barna, Libby Bawcombe, Noah Gordon, Betsy Ebersole, and Jennie Rothenberg Gritz who took some of the Bookworm prototypes and built them into a navigable, attractive overall package. Thanks to everyone.

Dec 30 2014

Federal College Rankings: The pitfalls of a magical regression model

Far and away the most interesting idea of the new government college ratings emerges toward the end of the report. It doesn’t quite square the circle of competing constituencies for the rankings I worries about in my last post, but it gets close. Lots of weight is placed on a single magic model that will predict outcomes regardless of all the confounding factors they raise (differing pay by gender, sex, possibly even degree composition). As an inveterate modeler and data hound, I can see the appeal here. The federal government has far better data than US News and World Report, in the guise of the student loan repayment forms; this data will enable all sorts of useful studies on the effects of everything from home-schooling to early-marriage. I don’t know that anyone is using it yet for the sort of studies it makes possible (do you?), but it sounds like they’re opening the vault just for these college ranking purposes.

Dec 30 2014

Federal college rankings: who are they for?

Before the holiday, the Department of Education circulated a draft prospectus of the new college rankings they hope to release next year.That afternoon, I wrote a somewhat dyspeptic post on the way that these rankings, like all rankings, will inevitably be gamed. But it’s probably better to bury that off and instead point out a couple looming problems with the system we may be working under soon. The first is that the audience for these rankings is unresolved in a very problematic way; the second is that altogether two much weight is placed on a regression model solving every objection that has been raised. Finally, I’ll lay out my “constructive” solution for salvaging something out of this, which is that rather than use a three-tiered “excellent” - “adequate” - “needs improvement”, everyone would be better served if we switched to a two-tiered “Good”/“Needs Improvement” system. Since this is sort of long, I’ll break it up into three posts: the first is below.

Dec 18 2014

Administrative layers

Sometimes it takes time to make a data visualization, and sometimes they just fall out of the data practically by accident. Probably the most viewed thing I’ve ever made, of shipping lines as spaghetti strings, is one of the latter. I’m working to build one of the former for my talk at the American Historical Association out of the Newberry Library’s remarkable Atlas of Historical County Boundaries. But my second ggplot with the set, which I originally did just to make sure the shapefiles were working, was actually interesting. So I thought I’d post it. Here’s the graphic: then the explanation. Click to enlarge.

Dec 16 2014

Fundamental plot arcs, seen through multidimensional analysis of thousands of TV and movie scripts

Note: a somewhat more complete and slightly less colloquial, but eminently more citeable, version of this work is in the [Proceedings of the 2015 IEEE International Conference on Big Data](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7363937&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabsall.jsp%3Farnumber%3D7363937). Plus, it was only there that I came around to calling the whole endeavor “plot arceology.”_
It’s interesting to look, as I did at my last post, at the plot structure of typical episodes of a TV show as derived through topic models. But while it may help in understanding individual TV shows, the method also shows some promise on a more ambitious goal: understanding the general structural elements that most TV shows and movies draw from. TV and movies scripts are carefully crafted structures: I wrote earlier about how the Simpsons moves away from the school after its first few minutes, for example, and with this larger corpus even individual words frequently show a strong bias towards the front or end of scripts. These crafting shows up in the ways language is distributed through them in time.

Dec 11 2014

Typical TV episodes: visualizing topics in screen time

The most interesting element of the Bookworm browser for movies I wrote about in my last post here is the possibility to delve into the episodic structure of different TV shows by dividing them up by minutes. On my website, I previously wrote about story structures in the Simpsons and a topic model of movies I made using the general-purpose bookworm topic modeling extension. For a description of the corpus or of topic modeling, see those links.

Sep 15 2014

Screen time!

Here’s a very fun, and for some purposes, perhaps, a very useful thing: a Bookworm browser that lets you investigate onscreen language in about 87,000 movies and TV shows, encompassing together over 600 million words. (Go follow that link if you want to investigate yourself).

Sep 11 2014

Some links to myself

An FYI, mostly for people following this feed on RSS: I just put up on my home web site a post about applications for the Simpsons Bookworm browser I made. It touches on a bunch of stuff that would usually lead me to post it here. (Really, it hits the Sapping Attention trifecta: a discussion of the best ways of visualizing Dunning Log-Likelihood, cryptic allusions to critical theory; and overly serious discussions of popular TV shows.). But it’s even less proofread and edited than what I usually put here, and I’ve lately been more and more reluctant to post things on a Google site like this, particularly as blogger gets folded more and more into Google Plus. That’s one of the big reasons I don’t post here as much as I used to, honestly. (Another is that I don’t want to worry about embedded javascript). So, head over there if you want to read it.

Aug 13 2014

Data visualization rules, 1915

Right now people in data visualization tend to be interested in their field’s history, and people in digital humanities tend to be fascinated by data visualization. Doing some research in the National Archives in Washington this summer, I came across an early set of rules for graphic presentation by the Bureau of the Census from February 1915. Given those interests, I thought I’d put that list online.

May 23 2014

Mind the gap: Incomes, college majors, gender, and higher ed reform

People love to talk about how “practical” different college majors are: and practicality is usually majored in dollars. But those measurements can be very problematic, in ways that might have bad implications for higher education. That’s what this post is about.

Apr 03 2014

Biblio bizarre: who publishes in Google Books

Here’s a little irony I’ve been meaning to post. Large scale book digitization makes tools like Ngrams possible; but it also makes tools like Ngrams obsolete for the future. It changes what a “book” is in ways that makes the selection criteria for Ngrams—if it made it into print, it must have _some _significance—completely meaningless.

Mar 31 2014

Shipping maps and how states see

A map I put up a year and a half ago went viral this winter; it shows the paths taken by ships in the US Maury collection of the ICOADS database. I’ve had several requests for higher-quality versions: I had some up already, but I just put up on Flickr a basically comparable high resolution version. US Maury is “Deck 701” in the ICOADS collection: I also put up charts for all of the other decks with fewer than 3,000,000 points. You can page through them below, or download the high quality versions from Flickr directly. (At the time of posting, you have to click on the three dots to get through to the summaries).

Jun 27 2013

Crisis in the humanities, or just women in the workplace?

OK: one last post about enrollments, since the statistic that humanities degrees have dropped in half since 1970 is all over the news the last two weeks. This is going to be a bit of a data dump: but there’s a shortage of data on the topic out there, so forgive me.

Jun 26 2013

Gender and the long-term decline in humanities enrollments

A quick addendum to my post on long-term enrollment trends in the humanities. (This topic seems to have legs, and I have lots more numbers sitting around I find useful, but they’ve got to wait for now).

Jun 07 2013

Some long term perspective on the "crisis" in humanities enrollment

There was an article in the Wall Street Journal about low enrollments in the humanities yesterday. The heart of the story is that the humanities resemble the late Roman Empire, teetering on a collapse precipitated by their inability to get jobs like those computer scientists can provide. (Never mind that the news hook is a Harvard report about declining enrollments in the humanities, which makes pretty clear that the real problem is students who are drawn to social sciences, not competitition from computer scientists.)

May 24 2013

Turning-point years in history

What are the major turning points in history? One way to think about that is to simply look at the most frequent dates used to start or end dissertation periods.* That gives a good sense of the general shape of time.

May 09 2013

What years do historians write about?

Here’s some inside baseball: the trends in periodization in history dissertations since the beginning of the American historical profession. A few months ago, Rob Townsend, who until recently kept everyone extremely well informed about professional trends at American Historical Association* sent me the list of all dissertation titles in history the American Historical Association knows about from the last 120 years. (It’s incomplete in some interesting ways, but that’s a topic for another day). It’s textual data. But sometimes the most interesting textual data to analyze quantitatively are the numbers that show up. Using a Bookworm database, I just pulled out from the titles the any years mentioned: that lets us what periods of the past historians have been the most interested in, and what sort of periods they’ve described..

Apr 12 2013

How not to topic model: an introduction for humanists.

The new issue of the Journal of Digital Humanities is up as of yesterday: it includes an article of mine, “Words Alone,” on the limits of topic modeling. In true JDH practice, it draws on my two previous posts on topic modeling, here and here. If you haven’t read those, the JDH article is now the place to go. (Unless you love reading prose chock full’ve contractions and typos. Then you can stay in the originals.) If you have read them, you might want to know what’s new or why I asked the JDH editors to let me push those two articles together. In the end, the changes ended up being pretty substantial.

Mar 29 2013

Patchwork Libraries

The hardest thing about studying massive quantities of digital texts is knowing just what texts you have. This is knowlege that we haven’t been particularly good at collecting, or at valuing.

Feb 28 2013

Canonic authors and the pronouns that they used

My last post had the aggregate statistics about which parts of the library have more female characters. (Relatively). But in some ways, it’s more interesting to think about the ratio of male and female pronouns in terms of authors whom we already know. So I thought I’d look for the ratios of gendered pronouns in the most-collected authors of the late 19th and early twentieth centuries, to see what comes out.

Feb 25 2013

Genders and Genres: tracking pronouns

Now back to some texts for a bit. Last spring, I posted a few times about the possibilities for reading genders in large collections of books. I didn’t follow up because I have some concerns about just what to do with this sort of pronoun data. But after talking about it to Ryan Cordell’s class at Northeastern last week, I wanted to think a little bit more about the representation of male and female subjects in late-19th century texts. Further spurs were Matt Jockers recently posted the pronoun usage in his corpus of novels; Jeana Jorgensen pointed to recent research by Kathleen Ragan that suggests that editorial and teller effects have a massive effect on the gender of protagonists in folk tales. Bookworm gives a great platform for looking at this sort of question.

Feb 14 2013

Anachronism patterns suggest there's nothing special about words

I’m cross-posting here a piece from my language anachronisms blog, Prochronisms.

Feb 06 2013

Are words the atomic unit of a dynamic system?

My last post was about how the frustrating imprecisions of language drive humanists towards using statistical aggregates instead of words: this one is about how they drive scientists to treat words as fundamental units even when their own models suggest they should be using something more abstract.

Jan 10 2013

Crossroads

Just a quick post to point readers of this blog to my new Atlantic article on anachronisms in Kushner/Spielberg’s Lincoln; and to direct Atlantic readers interested in more anachronisms over to my other blog, Prochronisms, which is currently churning on through the new season of Downton Abbey. (And to stick around here; my advanced market research shows you might like some of the posts about mapping historical shipping routes.)

Jan 09 2013

Keeping the words in Topic Models

Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there’s too little skepticism about the technique, I’m venturing to provide it (even with, I’m sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to ’topics’ in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.

Nov 15 2012

Military History and data: the US Navy in World War II

A stray idea left over from my whaling series: just how much should digital humanists be flocking to military history? Obviously the field is there a bit already: the Digital Scholarship lab at Richmond in particular has a number of interesting Civil War projects, and the Valley of the Shadow is one of the archetypal digital history projects. But it’s possible someone could get a lot of mileage out of doing a lot more.

Nov 15 2012

Reading digital sources: a case study in ship's logs

[Temporary note, March 2015: those arriving from reddit may also be interested in this post, which has a bit more about the specific image and a few more like it.]

Nov 14 2012

Where are the individuals in data-driven narratives?

Note: this post is part 5 of my series on whaling logs and digital history. For the full overview, click here.

Nov 02 2012

When you have a MALLET, everything looks like a nail

Note: this post is part 4, section 2 of my series on whaling logs and digital history. For the full overview, click here.

Nov 01 2012

Machine Learning at sea

Note: this post is part 4 of my series on whaling logs and digital history. For the full overview, click here.

Oct 30 2012

Data narratives and structural histories: Melville, Maury, and American whaling

Note: this post is part I of my series on whaling logs and digital history. For the full overview, click here.

Oct 18 2012

Word counts rule of thumb

Here’s a special post from the archives of my ‘ too-boring for prime time ’ files. I wrote this a few months ago but didn’t know if anyone needed: but now I’ll pull it out just for Scott Weingart since I saw him estimating word counts using ‘ the, ’ which is exactly what this post is about. If that sounds boring to you: for heaven’s sake, don’t read any further.

Oct 18 2012

Melville Plots

Note: this post is part III of my series on whaling logs and digital history. For the full overview, click here.

Oct 12 2012

Logbooks and the long history of digitization

Note: this post is part II of my series on whaling logs and digital history. For the full overview, click here.

Sep 25 2012

Advertising and politics

I’ve now seen a paragraph about advertising inJill Lepore’s latest New Yorker piece in a few places, including Andrew Sullivan’s blog. Digital history blogging should resume soon, but first some advertising history, since something weird is going on here:

Jul 31 2012

The Wide World of Physics

I’ve been thinking more than usual lately about spatially representing the data in the various Bookworm browsers.

Jul 12 2012

Making and publishing history in the Civil War

A follow up on my post from yesterday about whether there’s more history published in times of revolution. I was saying that I thought the dataset Google uses must be counting documents of historical importance as history: because libraries tend to shelve in a way that conflates things that are about history and things that are history.

Jul 11 2012

Do revolutionaries really read history?

A quick post about other people’s data, when I should be getting mine in order:

May 08 2012

Women in the libraries

It’s pretty obvious that one of the many problems in studying history by relying on the print record is that writers of books are disproportionately male.

May 07 2012

Author Genders: methodology

We just rolled out a new version of Bookworm (now going under the name “Bookworm Open Library”) that works on the same codebase as the ArXiv Bookworm released last month. The most noticeable changes are a cleaner and more flexible UI (mostly put together for the ArXiv by Neva Cherniavksy and Martin Camacho, and revamped by Neva to work on the OL version), couple with some behind-the-scenes tweaks that should make it easy to add new Bookworms on other sets of texts in the future. But as a little bonus, there’s an additional metadata category in the Open Library Bookworm we’re calling “author gender.”

Apr 27 2012

Publishing Libraries

[The American Antiquarian Society conference in Worcester last weekend had an interesting rider on the conference invitation–they wanted 500 words from each participant on the prospects for independent research libraries. I’m posting that response here.]

Apr 10 2012

Visualizing Ocean Shipping

I saw some historians talking on Twitter about a very nice data visualization of shipping routes in the 18th and 19th centuries on Spatial Analysis. (Which is a great blog–looking through their archives, I think I’ve seen every previous post linked from somewhere else before).

Apr 04 2012

Turning off the TV

Apr 02 2012

Digital Collections, Research Libraries, Collaboration

[The following is a revised version of my talk on the ‘ collaboration ’ panel at a conference about “Needs and Opportunities for the Research Library in the Digital Age” at the American Antiquarian Society in Worcester last week. Thanks to Paul Erickson for the invitation to attend, and everyone there for a fascinating weekend.]

Mar 21 2012

Mad Men anachronism hunting

[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

Mar 06 2012

Do women hide their gender by publishing under their initials?

A quick follow-up on this issue of author gender.

Mar 06 2012

Evidence of absence is not absence of evidence

I just saw that various Digital Humanists on Twitter were talking about representativeness, exclusion of women from digital archives, and other Big Questions. I can only echo my general agreement about most of the comments.

Feb 29 2012

Journal of Irreproduced results, vol. 1

I wanted to try to replicate and slightly expand Ted Underwood’s recent discussion of genre formation over time using the Bookworm dataset of Open Library books. I couldn’t, quite, but I want to just post the results and code up here for those who have been following that discussion. Warning: this is a rather dry post.

Feb 20 2012

Downton Abbey Anachronisms, Season Finale edition

[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

Feb 20 2012

Second epistle to the intellectual historians

The new USIH blogger LD Burnett has a post up expressing ambivalence about the digital humanities because it is too eager to reject books. This is a pretty common argument, I think, familiar to me in less eloquent forms from New York Times comment threads. It’s a rhetorically appealing position–to set oneself up as a defender of the book against the philistines who not only refuse to read it themselves, but want to take your books away and destroy them. I worry there’s some mystification involved–conflating corporate publishers with digital humanists, lumping together books with codices with monographs, and ignoring the tension between reader and consumer. This problem ties up nicely into the big event in DH in the last week–the announcement of the first issue of the ambitiously all-digital Journal of Digital Humanities. So let me take a minute away from writing about TV shows to sort out my preliminary thoughts on books.

Feb 13 2012

Making Downton more traditional

[Update: I’ve consolidated all of my TV anachronisms posts at a different blog, Prochronism, and new ones on Mad Men, Deadwood, Downton Abbey, and the rest are going there.]

Feb 02 2012

Poor man's sentiment analysis

Though I usually work with the Bookworm database of Open Library texts, I’ve been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there’s also a lot more that could be coming out of the Ngrams set than what I’ve seen in the last year.

Jan 30 2012

Fixing the job market in two modest steps

Another January, another set of hand-wringing about the humanities job market. So, allow me a brief departure from the digital humanities. First, in four paragraphs, the problem with our current understanding of the history job market; and then, in several more, the solution.

Jan 05 2012

Practices, the periphery, and Pittsburg(h)

[This is not what I’ll be saying at the AHA on Sunday morning, since I’m participating in a panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story I’d start with to show how much data we have, and how little things can have different meanings at big scales…]

Dec 16 2011

Genre similarities

When data exploration produces Christmas-themed charts, that’s a sign it’s time to post again. So here’s a chart and a problem.

Nov 19 2011

Treating texts as individuals vs. lumping them together

Ted Underwood has been talking up the advantages of the Mann-Whitney test over Dunning’s Log-likelihood, which is currently more widely used. I’m having trouble getting M-W running on large numbers of texts as quickly as I’d like, but I’d say that his basic contention–that Dunning log-likelihood is frequently not the best method–is definitely true, and there’s a lot to like about rank-ordering tests.

Nov 14 2011

Compare and Contrast

I may (or may not) be about to dash off a string of corpus-comparison posts to follow up the ones I’ve been making the last month. On the surface, I think, this comes across as less interesting than some other possible topics. So I want to explain why I think this matters now. This is not quite my long-promised topic-modeling post, but getting closer.

Nov 10 2011

Dunning Amok

A few points following up my two posts on corpus comparison using Dunning Log-Likelihood last month. Nur ein stueck Technik.

Nov 03 2011

Theory First

Natalie Cecire recently started an important debate about the role of theory in the digital humanities. She’s rightly concerned that the THATcamp motto–“more hack, less yack”–promotes precisely the wrong understanding of what digital methods offer:

Oct 07 2011

Dunning Statistics on authors

As promised, some quick thoughts broken off my post on Dunning Log-likelihood. There, I looked at _big_ corpuses–two history classes of about 20,000 books each. But I also wonder how we can use algorithmic comparison on a much smaller scale: particularly, at the level of individual authors or works. English dept. digital humanists tend to rely on small sets of well curated, TEI texts, but even the ugly wilds of machine OCR might be able to offer them some insights. (Sidenote–interesting post by Ted Underwood today on the mechanics of creating a middle group between these two poles).

Oct 06 2011

Comparing Corpuses by Word Use

Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning’s Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.

Sep 30 2011

Bookworm and library search

We just launched a new website, Bookworm, from the Cultural Observatory. I might have a lot to say about it from different perspectives; but since it was submitted to the DPLA beta sprint, let’s start with the way it helps you find library books.

Sep 05 2011

Is catalog information really metadata?

We’ve been working on making a different type of browser using the Open Library books I’ve been working with to date, and it’s raised a interesting question I want to think through here.

Aug 28 2011

Wars, Recessions, and the size of the ngrams corpus

Hank wants me to post more, so here’s a little problem I’m working on. I think it’s a good example of how quantitative analysis can help to remind us of old problems, and possibly reveal new ones, with library collections.

Aug 04 2011

Graphing and smoothing

I mentioned earlier I’ve been rebuilding my database; I’ve also been talking to some of the people here at Harvard about various follow-up projects to ngrams. So this seems like a good moment to rethink a few pretty basic things about different ways of presenting historical language statistics. For extensive interactions, nothing is going to beat a database or direct access to text files in some form. But for quick interactions, which includes a lot of pattern searching and public outreach, we have some interesting choices about presentation.

Jul 15 2011

Moving

Starting this month, I’m moving from New Jersey to do a fellowship at the Harvard Cultural Observatory. This should be a very interesting place to spend the next year, and I’m very grateful to JB Michel and Erez Lieberman Aiden for the opportunity to work on an ongoing and obviously ambitious digital humanities project. A few thoughts on the shift from Princeton to Cambridge:

Jun 16 2011

What's new?

Let me get back into the blogging swing with a (too long—this is why I can’t handle Twitter, folks) reflection on an offhand comment. Don’t worry, there’s some data stuff in the pipe, maybe including some long-delayed playing with topic models.

May 10 2011

Predicting publication year and generational language shift

Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn’t happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like “outside” more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.

Apr 18 2011

The 1940 election

A couple weeks ago, I wrote about how ancestry.com structured census data for genealogy, not history, and how that limits what historians can do with it. Last week, I got an interesting e-mail from IPUMS, at the Minnesota population center on just that topic:

Apr 13 2011

In search of the great white whale

All the cool kids are talking about shortcomings in digitized text databases. I don’t have anything so detailed to say as what Goose Commerce or Shane Landrum have gone into, but I do have one fun fact. Those guys describe ways that projects miss things we might think are important but that lie just outside the most mainstream interests—the neglected Early Republic in newspapers, letters to the editor in journals, etc. They raise the important point that digital resources are nowhere near as comprehensive as we sometimes think, which is a big caveat we all need to keep in mind. I want to point out that it’s not just at the margins we’re missing texts: omissions are also, maybe surprisingly, lurking right at the heart of the canon. Here’s an example.

Apr 11 2011

Age cohort and Vocabulary use

Let’s start with two self-evident facts about how print culture changes over time:

Apr 03 2011

Stopwords to the wise

Shane Landrum (@cliotropic) says my claim that historians have different digital infrastructural needs than other fields might be provocative. I don’t mean this as exceptionalism for historians, particularly not compared to other humanities fields. I do think historians are somewhat exceptional in the volume of texts they want to process—at Princeton, they often gloat about being the heaviest users of the library. I do think this volume is one important reason English has a more advanced field of digital humanities than history does. But the needs are independent of the volume, and every academic field has distinct needs. Data, though, is often structured for either one set of users, or for a mushy middle.

Apr 01 2011

Generations vs. contexts

When I first thought about using digital texts to track shifts in language usage over time, the largest reliable repository of e-texts was Project Gutenberg. I quickly found out, though, that they didn’t have works for years, somewhat to my surprise. (It’s remarkable how much metadata holds this sort of work back, rather than data itself). They did, though, have one kind of year information: author birth dates. You can use those to create same type of charts of word use over time that people like me, the Victorian Books project, or the Culturomists have been doing, but in a different dimension: we can see how all the authors born in a year use language rather than looking at how books published in a year use language.

Mar 28 2011

Cronon's politics

Let me step away from digital humanities for just a second to say one thing about the Cronon affair.
(Despite the professor-blogging angle, and that Cronon’s upcoming AHA presidency will probably have the same pro-digital history agenda as Grafton’s, I don’t think this has much to do with DH). The whole “we are all Bill Cronon” sentiment misses what’s actually interesting. Cronon’s playing a particular angle: one that gets missed if we think about him as either a naïve professor, stumbling into the public sphere, or as a liberal ideologue trying to score some points.

Mar 24 2011

Author Ages

Back from Venice (which is plastered with posters for “Mapping the Republic of Letters,” making a DH-free vacation that much harder), done grading papers, MAW paper presented. That frees up some time for data. So let me start off looking at a new pool for book data for a little while that I think is really interesting.

Mar 02 2011

What historians don't know about database design

url: /2011/03/what-historians-dont-know-about.html —

Feb 22 2011

Genres in Motion

Here’s an animation of the PCA numbers I’ve been exploring this last week.

Feb 20 2011

Vector Space, overlapping genres, and the world beyond keyword search

I wanted to see how well the vector space model of documents I’ve been using for PCA works at classifying individual books. [Note at the outset: this post swings back from the technical stuff about halfway through, if you’re sick of the charts.] While at the genre level the separation looks pretty nice, some of my earlier experiments with PCA, as well as some of what I read in the Stanford Literature Lab’s Pamphlet One, made me suspect individual books would be sloppier. There are a couple different ways to ask this question. One is to just drop the books as individual points on top of the separated genres, so we can see how they fit into the established space. By the first two principal components, for example, we can make all the books in LCC subclasses “BF” (psychology) blue, and use red for “QE” (Geology), overlaying them on a chart of the first two principal components like I’ve been using for the last two posts:

Feb 17 2011

PCA on years

I used principal components analysis at the end of my last post to create a two-dimensional plot of genre based on similarities in word usage. As a reminder, here’s an improved (using all my data on the 10,000 most common words) version of that plot:

Feb 14 2011

Fresh set of eyes

One of the most important services a computer can provide for us is a different way of reading. It’s fast, bad at grammar, good at counting, and generally provides a different perspective on texts we already know in one way.

Feb 11 2011

Going it alone

I’ve spent a lot of the last week trying to convince Princeton undergrads it’s OK to occasionally disagree with each other, even if they’re not sure they’re right. So let me make one of my notes on one of the places I’ve felt a little bit of skepticism as I try to figure what’s going on with the digital humanities.

Feb 02 2011

Graphing word trends inside genres

Genre information is important and interesting. Using the smaller of my two book databases, I can get some pretty good genre information about some fields I’m interested in for my dissertation by using the Library of Congress classifications for the books. I’m going to start with the difference between psychology and philosophy. I’ve already got some more interesting stuff than these basic charts, but I think a constrained comparison like this should be somewhat more clear.

Feb 01 2011

Technical notes

I’m changing several things about my data, so I’m going to describe my system again in case anyone is interested, and so I have a page to link to in the future.

Jan 31 2011

Where were 19C US books published?

Open Library has pretty good metadata. I’m using it to assemble a couple new corpuses that I hope should allow some better analysis than I can do now, but just the raw data is interesting. (Although, with a single 25 GB text file the best way to interact with it, not always convenient). While I’m waiting for some indexes to build, that will give a good chance to figure out just what’s in these digital sources.

Jan 28 2011

Picking texts, again

I’m trying to get a new group of texts to analyze. We already have enough books to move along on certain types of computer-assisted textual analysis. The big problems are OCR and metadata. Those are a) probably correlated somewhat, and b) partially superable. I’ve been spending a while trying to figure out how to switch over to better metadata for my texts (which actually means an almost all-new set of texts, based on new metadata). I’ve avoided blogging the really boring stuff, but I’m going to stay with pretty boring stuff for a little while (at least this post and one more later, maybe more) to get this on the record.

Jan 21 2011

Digital history and the copyright black hole

In writing about openness and the ngrams database, I found it hard not to reflect a little bit about the role of copyright in all this. I’ve called 1922 the year digital history ends before; for the kind of work I want to see, it’s nearly an insuperable barrier, and it’s one I think not enough non-tech-savvy humanists think about. So let me dig in a little.

Jan 20 2011

Openness and Culturomics

The Culturomics authors released a FAQ last week that responds to many of the questions floating around about their project. I should, by trade, be most interested in their responses to the lack of humanist involvement. I’ll get to that in a bit. But instead, I find myself thinking more about what the requirements of openness are going to be for textual research.

Jan 18 2011

Cluster Charts

I’ll end my unannounced hiatus by posting several charts that show the limits of the search-term clustering I talked about last week before I respond to a couple things that caught my interest in the last week.

Jan 11 2011

Clustering from Search

Because of my primitive search engine, I’ve been thinking about some of the ways we can better use search data to a) interpret historical data, and b) improve our understanding of what goes on when we search. As I was saying then, there are two things that search engines let us do that we usually don’t get:

Jan 10 2011

Searching for Correlations

More access to the connections between words makes it possible to separate word-use from language. This is one of the reasons that we need access to analyzed texts to do any real digital history. I’m thinking through ways to use patterns of correlations across books as a way to start thinking about how connections between words and concepts change over time, just as word count data can tell us something (fuzzy, but something) about the general prominence of a term. This post is about how the search algorithm I’ve been working with can help improve this sort of search. I’ll get back to evolution (which I talked about in my post introducing these correlation charts) in a day or two, but let me start with an even more basic question that illustrates some of the possibilities and limitations of this analysis: What was the Civil War fought about?

Jan 06 2011

Basic Search

To my surprise, I built a search engine as a consequence of trying to quantify information about word usage in the books I downloaded from the Internet Archive. Before I move on with the correlations I talked about in my last post, I need to explain a little about that.

Jan 05 2011

Correlations

How are words linked in their usage? In a way, that’s the core question of a lot of history. I think we can get a bit of a picture of this, albeit a hazy one, using some numbers. This is the first of two posts about how we can look at connections between discourses.

Dec 30 2010

Assisted Reading vs. Data Mining

I’ve started thinking that there’s a useful distinction to be made in two different ways of doing historical textual analysis. First stab, I’d call them:

Dec 27 2010

Call numbers

I finally got some call numbers. Not for everything, but for a better portion than I thought I would: about 7,600 records, or c. 30% of my books.

Dec 26 2010

Finding keywords

Before Christmas, I spelled out a few ways of thinking about historical texts as related to other texts based on their use of different words, and did a couple examples using months and some abstract nouns. Two of the problems I’ve had with getting useful data out of this approach are:

Dec 23 2010

What good are the 5-grams?

Dan Cohen gives the comprehensive Digital Humanities treatment on Ngrams, and he mostly gets it right. There’s just one technical point I want to push back on. He says the best research opportunities are in the multi-grams. For the post-copyright era, this is true, since they are the only data anyone has on those books. But for pre-copyright stuff, there’s no reason to use the ngrams data rather than just downloading the original books, because:

Dec 23 2010

Second Principals

Back to my own stuff. Before the Ngrams stuff came up, I was working on ways of finding books that share similar vocabularies. I said at the end of my second ngrams post that we have hundreds of thousands of dimensions for each book: let me explain what I mean. My regular readers were unconvinced, I think, by my first foray here into principal components, but I’m going to try again. This post is largely a test of whether I can explain principal components analysis to people who don’t know about it so: correct me if you already understand PCA, and let me know me know what’s unclear if you don’t. (Or, it goes without saying, skip it.)

Dec 19 2010

Not included in ngrams: Tom Sawyer

I wrote yesterday about how well the filters applied to remove some books from ngrams work for increasing the quality of year information and OCR compared to Google books.

Dec 18 2010

State of the Art/Science

As I said: ngrams represents the state of the art for digital humanities right now in some ways. Put together some smart Harvard postdocs, a few eminent professors, the Google Books team, some undergrad research assistants for programming, then give them access to Google computing power and proprietary information to produce the datasets, and you’re pretty much guaranteed an explosion of theories and methods.

Dec 17 2010

Missing humanists

(First in a series on yesterday’s Google/Harvard paper in Science and its reception.)

Dec 17 2010

Culturomics

Days from when I said “Google Trends for historical terms might be worse than nothing” to the release of “Google ngrams:” 12. So: we’ll get to see!

Dec 15 2010

How Bad is Internet Archive OCR?

We all know that the OCR on our digital resources is pretty bad. I’ve often wondered if part of the reason Google doesn’t share its OCR is simply it would show so much ugliness. (A common misreading, ‘ tlie ’ for ‘ the ’, gets about 4.6m results in Google books). So how bad is the the internet archive OCR, which I’m using? I’ve started rebuilding my database, and I put in a few checks to get a better idea. Allen already asked some questions in the comments about this, so I thought I’d dump it on to the internet, since there doesn’t seem to be that much out there.

Dec 14 2010

Avoidance tactics

Can historical events suppress use of words? Usage of the word ‘ panic ’ seems to spike down around the bank panics of 1873 and 1893, and maybe 1837 too. I’m pretty confident this is just an artifact of me plugging in a lot of words in to test out how fast the new database is and finding some random noise. There are too many reasons to list: 1857 and 1907 don’t have the pattern, the rebound in 1894 is too fast, etc. It’s only 1873 that really looks abnormal. What do you think:

Dec 13 2010

Capitalist lackeys

I’m interested in the ways different words are tied together. That’s sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for “scientific method,” but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. I’m going to think through this staying on “capitalist” as the word of the day. Fair warning: this post is a rambler.

Dec 09 2010

Metadata for OCR books

A commenter asked about why I don’t improve the metadata instead of doing this clustering stuff, which seems just poorly to reproduce the work of generations of librarians in classifying books. I’d like to. The biggest problem right now for text analysis for historical purposes is metadata (followed closely by OCR quality). What are the sources? I’m going to think through what I know, but I’d love any advice on this because it’s really outside my expertise.

Dec 08 2010

First Principals

Let me get ahead of myself a little.

Dec 07 2010

Back to the Future

Maybe this is just Patricia Cohen’s take, but it’s interesting to note that she casts both of the text mining projects she’s put on the Times site this week (Victorian books and the Stanford Literature Lab) as attempts to use modern tools to address questions similar to vast, comprehensive tomes written in the 1950s. There are good reasons for this. Those books are some of the classics that informed the next generation of scholarship in their field; they offer an appealing opportunity to find people who should have read more than they did; and, more than some recent scholarship, they contribute immediately to questions that are of interest outside narrow disciplinary communities. (I think I’ve seen the phrase ‘ public intellectuals ’ more times in the four days I’ve been on Twitter than in the month before). One of the things that the Times articles highlight is how this work can re-engage a lot of the general public with current humanities scholarship.

Dec 06 2010

The Age of Capital--

Dan asks for some numbers on “capitalism” and “capitalist” similar to the ones on “Darwinism” and “Darwinist” I ran for Hank earlier. That seems like a nice big question I can use to get some basic methods to warm up the new database I set up this week and to get some basic functionality written into it.

Dec 04 2010

Full-text American versions of the Times charts

This verges on unreflective datadumping: but because it’s easy and I think people might find it interesting, I’m going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen’s charts of title word counts. I’ve tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren’t many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends–thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren’t.

Dec 04 2010

Today's Times Article

Patricia Cohen’s new article about the digital humanities doesn’t come with the rafts of crotchety comments the first one did, so unlike last time I’m not in a defensive crouch. To the contrary: I’m thrilled and grateful that Dan Cohen, the main subject of the article, took the time in his moment in the sun to link to me. The article itself is really good, not just because the Cohen-Gibbs Victorian project is so exciting, but because P. Cohen gets some thoughtful comments and the NYT graphic designers, as always, do a great job. So I just want to focus on the Google connection for now, and then I’ll post my versions of the charts the Times published.

Dec 04 2010

Now with actual text!

Lexical analysis widens the hermeneutic circle. The statistics need to be kept close to the text to keep any work sufficiently under the researcher’s control. I’ve noticed that when I ask the computer to do too much work for me in identifying patterns, outliers, and so on, it frequently responds with mistakes in the data set, not with real historical data. So as I start to harness this new database, one of the big questions is how to integrate what the researcher already knows into the patterns he or she is analyzing.

Dec 03 2010

Quick, extremely relevant outlinks

Dan Cohen, the hub of all things digital history, in the news and on his blog.

Dec 03 2010

What's worth knowing?

I have my database finally running in a way that lets me quickly select data about books. So now I can start to ask questions that are more interesting than just how overall vocabulary shifted in American publishers. The question is, what sort of questions? I’ll probably start to shift to some of my dissertation stuff, about shifts in verbs modifying “attention”, but there are all sorts of things we can do now. I’m open to suggestions, but here are some random examples:

Dec 03 2010

Centennials, part II

So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interesting–how does the publishing industry focus in on certain figures to create news or resurgences of interest in them? I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.

Dec 03 2010

Centennials, part I.

I was starting to write about the implicit model of historical change behind loess curves, which I’ll probably post soon, when I started to think some more about a great counterexample to the gradual change I’m looking for: the patterns of commemoration for anniversaries. At anniversaries, as well as news events, I often see big spikes in wordcounts for an event or person.

Dec 02 2010

Do it yourself

Jamie’s been asking for some thoughts on what it takes to do this–statistics backgrounds, etc. I should say that I’m doing this, for the most part, the hard way, because 1) My database is too large to start out using most tools I know of, including I think the R text-mining package, and 2) I want to understand how it works better. I don’t think I’m going to do the software review thing here, but there are what look like a _lot _of promising leads at an American Studies blog.

Dec 01 2010

Digital Humanities and Humanities Computing

I’ve had “digital humanities” in the blog’s subtitle for a while, but it’s a terribly offputting term. I guess it’s supposed to evoke future frontiers and universal dissemination of humanistic work, but it carries an unfortunate implication that the analog humanities are something completely different. It makes them sound older, richer, more subtle—and scheduled for demolition. No wonder a world of online exhibitions and digital texts doesn’t appeal to most humanists of the tweed– and dust-jacket crowd. I think we need a distinction that better expresses how digital technology expands the humanities, rather than constraining it.

Dec 01 2010

Programming and other Languages

Jamie asked about assignments for students using digital sources. It’s a difficult question.

Dec 01 2010

Catalog data and genre

Mostly a note to myself:

Nov 28 2010

Top ten authors

Most intensive text analysis is done on heavily maintained sources. I’m using a mess, by contrast, but a much larger one. Partly, I’m doing this tendentiously–I think it’s important to realize that we can accept all the errors due to poor optical character recognition, occasional duplicate copies of works, and so on, and still get workable materials.

Nov 28 2010

Clustering isms together

In addition to finding the similarities in use between particular isms, we can look at their similarities in general. Since we have the distances, it’s possible to create a dendrogram, which is a sort of family tree. Looking around the literary studies text-analysis blogs, I see these done quite a few times to classify works by their vocabulary. I haven’t seen much using words, though: but it works fairly well. I thought it might help answer Hank’s question about the difference between evolutionism and darwinism, but, as you’ll see, that distinction seems to be a little too fine for now.

Nov 27 2010

Comparing usage patterns across the isms

What can we do with this information we’ve gathered about unexpected occurrences? The most obvious thing is simply to look at what words appear most often with other ones. We can do this for any ism given the data I’ve gathered. Hank asked earlier in the comments about the difference between “Darwinism” and evolutionism, so:

Nov 26 2010

Measuring word collocation, part III

Now to the final term in my sentence from earlier— “How often, compared to what we would expect, does a given word appear with any other given word?”**.** Let’s think about How much more often. I though this was more complicated than it is for a while, so this post will be short and not very important.

Nov 26 2010

As I think of things

Nov 26 2010

Measuring word collocation, part II

This is the second post on ways to measure connections—or more precisely, distance—between words by looking at how often they appear together in books. These are a little dry, and the payoff doesn’t come for a while, so let me remind you of the payoff (after which you can bail on this post). I’m trying to create some simple methods that will work well with historical texts to see relations between words—what words are used in similar semantic contexts, what groups of words tend to appear together. First I’ll apply them to the isms, and then we’ll put them in the toolbox to use for later analysis.
I said earlier I would break up the sentence “How often, compared to what we would expect, does a given word appear with any other given word?” into different components. Now let’s look at the central, and maybe most important, part of the question—how often do we expect words to appear together?

Nov 25 2010

Back from Moscow--where to now?

I’m back from Moscow, and with a lot of blog content from my 23-hour itinerary. I’m going to try to dole it out slowly, though, because a lot of it is dull and somewhat technical, and I think it’s best to intermix with other types of content. I think there are four things I can do here.

Nov 23 2010

Links between words

Ties between words are one of the most important things computers can tell us about language. I already looked at one way of looking at connections between words in talking about the phrase “scientific method”--the percentage of occurrences of a word that occur with another phrase. I’ve been trying a different tack, however, in looking at the interrelations among the isms. The whole thing has been do complicated–I never posted anything from Russia because I couldn’t get the whole system in order in my time here. So instead, I want to take a couple posts to break down a simple sentence and think about how we could statistically measure each component. Here’s the sentence:

Nov 18 2010

Tags