You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Genres in Motion

Feb 22 2011

Heres an animation of the PCA numbers Ive been exploring this last week.

Theres quite a bit of data built in here, and just what it means is up for grabs. But it shows some interesting possibilities. As a reminder: at the end of my first post on categorizing genres, I arranged all the genres in the Library of Congress Classification in two dimensional space using the first two principal components. PCA basically find the combinations of variables that most define the differences within a group. (Read more by me here or generally here.). The first dimension roughly corresponded to science vs. non-science: the second separated social science from the humanities. It did, I think, a pretty good job at showing which fields were close to each other. But since I do history, I wanted to know: do those relations change? Heres that same data, but arranged to show how those positions shift over time. I made this along the same lines as the great Rosling/Gapminder bubble charts, created with this via this. To get it started, Im highlighting psychology.
<p><p><p><p><p><p><p><p>You need iframes and javascript or something to display this content</p></p></p></p></p></p></p></p>

[If this doesnt load, you can click through to the file here]. What in the world does this mean?
Thats partly up to you to decide, but heres a rough guide to reading it. [Note: this doesnt seem to work in Google Reader, you may need to head to the web site to see the graphics.] The chart begins in 1863 for psychology. (Thats the year at the bottom: it goes back to 1861 for some fields). At that point, psychology is closest to the general philosophy/psychology LC classification B; its word-use patterns also align it with education, social pathology, and American history. You can look at the general space to get a sense of what PCA regards as the basic characteristics to separate on. The first principal component, for example, runs from south to north. All of the high scorers are science and technology disciplines; chemistry is the most distinctive from other subclasses, and biology the least. In the south are the social sciences and humanities; three religion classes are the least scientific works in the sample. (Scare quotes necessary because PCA word-choice differs from semantics in important ways). The second component runs from right to left, and separates out the humanities (literature above all) from the law and the emerging social sciences at the bottom; it also finds a set of words that create a similar split in the north from medicine to manufacturing. I like to think of that axis as being from personal to social in some way, but thats probably an exaggeration. (Although there are nice touches to support it: for example, Public aspects of Medicine is the most social of the medical fields, and engineering the most social of the sciences).

So thats the landscape. Press play to see how it shifts over the period 1861-1922, if you havent yet. Youll see psychology track steadily [*See important disclaimer about smoothing at the bottom] towards medicine in the upper left until about 1879. (Which is the date of Wundts lab, traditionally the origin of scientific psychologyspooky!). The dots are quite smallhover over, and youll see it only has a few hundred thousand words a year. That means only a couple books annually at first (rule of thumb: 100,000 words = one or two books), and none in a few. Once the field actually gets established, it settles into a relatively constrained area that it occupies until 1922. The size of the dots grows, indicating that more books are published in psychology.

What other fields might be interesting? I like LB, theory and practice of education, which makes a late dash towards the social sciences after crossing paths with psychology. Rather than go through them all myself, Im just throwing it out there. Google did a great job making the charts fully interactive; you can change any aspect of themturn on and off trails, highlight other groups, change the color classification scheme to work by LC class instead of my made-up higher classes, etc. Click away, somethings bound to happen. Ive included a few additional principal components to substitute out for the axes, though I dont pretend to know just what they mean. Click on the axis name, for example, to switch out the second principal component separating years for the fourth in the psychology graph, for instance, to see a metric that finds a logic in the B classification Im somewhat blind to.

One thing that I find particularly interesting is the possibility of using a completely different set of criteria besides the PCA weights to chart on. I described earlier a separate PCA analysis that found axes of change over time: Ive put that chart in here, too. You can change the options on the above chart to generate it in the window above, but let me give you another example.

<p><p><p><p><p><p><p><p>You need iframes and javascript or something to display this content</p></p></p></p></p></p></p></p>

[Original version hosted here.]

The essential point to this chart is that all the classes drift to right over time. Thats because the left is finding language typical of the 1850s, and the right language typical of the 1910s. But within that overall drift, there is variation. Some genres are ahead of the times, and some behind; some travel against the current. I highlighted three interesting ones here: Chemistry, US Law, and US History. They show different possible paths. Chemistry begins ahead of the pack in 1861it is an advanced science, it doesnt republish old books much, etc. [See the caveat at the bottom about re-publication.] But as time goes on, its lead erodes. After about 1908, the language of chemistry is as modern as it will be before 1920. Engineering fields are using more modern language, physics catches up from well behind, and so forth. A lot of this probably represents percentage of reprinted books, but some of it as well is about what vocabularies have currency.

The law/history example shows something similar. Law moves back and forth, well behind the times in its vocabulary: while American history moves more steadily forward, with a number of notable pausesparticularly in the 60s/70s it moves backwards, and in the 90s it stays still long enough for law to catch up for a moment. Again, Im not sure whats driving itId need to build in a bit more processing to see what sort of vocabulary is causing it to take the course it doesbut its suggestive.

If you want to play around more, here is a wider version of the charts for free clicking that Ill keep up for a little while. These charts are really great for explorationthey let you change the axes, the arbitrary categories I use for colors, they let you look at moving bar graphs of the shifting distributions over time, etc. They really give a nice way to allow a lot more exploration of somewhat complex data sets than flat files allow, in addition to the cool animations.

I said this is suggestive. Thats a classic weasel word and raises the question: Whats the point here? I admit I mostly just wanted to see what this would look like, and Im going to stop with all the PCA for a bit. This is about as far as I can get on my own in a week while teaching. Still, let me think about what its good for for a minute.

The point I want to make is not necessarily that this particular PCA weighting gives us vast new insights, although there might be some that are good for spurring critical thinking. (The lack of any separation at all between social sciences/humanities along the axis that parcels off science is quite interesting, for example). Certainly its not that these particular weightings on a relatively small fraction of books are an end in themselves.

Rather, its another demonstration of the sort of objects and movements we can study now that werent possible even five years ago. Historians often want to write about Big Topics, but feel that they cant do so responsibly. Instead, we hunker down in to particular archives, study small movements, etc. We often want to write about concepts, discourses, genres as subjects, but finding ourselves forced to write about individuals instead because of the constraints of what we can read.

Theres a route in to the big questions by aggregation: we can really make genres and other big groups our subjects. (That the dots breath in and out certainly  helps further the illusion). We get to talk about structures, relations between genres and between discourses while still having a way to keep our feet on the ground. Or maybe to plant our feet on the ceiling for a changewe work down from the aggregates to find the individual cases, instead of vice versa. I think there are better ways to talk about, say, the scientization of psychology than these charts: as Allens been trying to convince me, topic modeling is probably one

I like this the most, then, as a sort of demonstration of the new perspectives that statistical techniques can offer on our sources. We can analyze in new ways, now, the big networks of terms and words and discourses all the humanists got so excited about in the 80s. The structure of ideas and the flow of knowledge on the largest levels is more accessible than its ever been.

I had an interesting talk last with Erez Lieberman-Aiden, one of the authors of the culturomics paper in Science and creator of ngrams, that Im still mulling over. He thinks one of the big results of data will be in allowing a new sort of research in the humanities that focuses on falsifiability, reproducible claims, and so forth. If that does happen, though, it will at the same time open up the field for much more traditional humanistic research. When we find new angles that take big shifts as facts principally to be explained, not just unearthed, well find more and more interesting things to write about even within the traditional framing of humanistic research.

~~~~~~~~~~

Fine Print: First, the components data is smoothed by a loess regression against year, span = .5, weighted by the number of books in each year. There simply arent enough books in each year in my 45,000 book sample to use the raw data stand in and get any sense of overall trends outside of the top 3 or 4 classes. Heres what it looks like: too much popcorn maker. I could do a decade-by-decade plot, but thats just a cruder form of smoothing. I figured Id go the pretty way and keep the year results, but its worth noting that the future is somewhat embedded in each years point. The data on number of words in each genre is not smoothed, and it provides an important corrective to assigning too much importance to the beginning or end of a disciplines trail.

I dropped one subclass: APPeriodicals. Periodicals arent in any particular genre, and they arent supposed to be in the Open Library to begin with. There are a few, though, and they come from very different fields in large numbers at increments of a few decades. The net result is that the dot sweeps around all across the screen, while not contributing to an understanding of genre relations. Its kind of neat to watch, though, and gives a little insight into how the smoothing works in the long stretches it tries to find a point while there is no data. Check it out if you like.

Finally, its worth mentioning again that my data includes reprints. Shakespeare, for example, may be an important part of the reason that British literature lags behind American. Open Library data makes it possible to use creation dates, but Im not sure that data is so reliable that its worth porting over to it. Plus, it says something important about fields if some continue to reprint lots of books from decades earlier while others do not.

Comments:

Im getting a window size here too small to se

Anonymous - Feb 2, 2011

Im getting a window size here too small to see the whole animation at once. Using Firefox. Would you recommend Chrome instead?

Ben - Feb 2, 2011

This comment has been removed by the author.

Ahah, I changed the setup and I think it might wor

Ben - Feb 2, 2011

Ahah, I changed the setup and I think it might work better? I chose an unfortunately kludgy way to display this.

It works for me too now. Very nice. Im intri

Anonymous - Feb 3, 2011

It works for me too now. Very nice. Im intrigued by the way Languages and Lit seems to lead the parade in the second animation.

I know these are based on LoC classifications. In practice, does that turn out to be mostly fiction/poetry/drama or mostly critical nonfiction? In this period, I would guess the former.

This is pretty cool not just technically as a visualization, but I think the underlying result is also interesting. Ive got some results that might be relevant, at least to the literary part of this. I think Ill have to write them up as a blog post.

Ive looked at the bubbles more carefully, and

Anonymous - Feb 3, 2011

Ive looked at the bubbles more carefully, and it appears that all fiction may be filed under PZ, along with juvenile belles lettres. Am I getting that right?

So the standard is the LC classification

Ben - Feb 3, 2011

So the standard is the LC classification descriptions, and past that I just have to scan titles. It looks to me like PZ is all fiction: here is a full list of every book title (no authors, alas) that Im using with links to the IA pages. Ugly, but it can give you an idea. Let me know if you want any other genres to get an idea whats in them, thats easy to do.

Looking at the components, I think the key to PZs lead is more casual language, in various wayscontractions, for example, really jump out.

I should say:

Ben - Feb 3, 2011

I should say: PR and PS both have a good deal of fiction too. Heres the list for them. Theres an artificial distinction between fiction and literature thats hard to nail downis Huckleberry Finn literature, juvenile belles-lettres, or fiction? Ive seen it categorized within my sample under at least two of them.

Thanks for those clarifications. I suspect you

Anonymous - Feb 3, 2011

Thanks for those clarifications. I suspect youre right about more casual language, in various ways. From a literary-historical perspective, though, it gets interesting to specify exactly what we mean by casual. Im working on a hypothesis about the changing relationship between oral and written language 1700-1900, and Im about to try to find out whether literary texts turn out to be outliers on the trend curves Im working with. This looks very much like a clue that they will be.

I suspect PS probably leads PR just because the volume of American prose overall is growing, making PS more forward looking in a sort of tautological way.

Great visualization, though; theres so much in there to think about.

Heres the promised post about one metric of &

Anonymous - Mar 5, 2011

Heres the promised post about one metric of casualness in English diction 1700-1900:
http://tedunderwood.wordpress.com/2011/03/17/a-selection-of-the-language-really-spoken-by-men/