You are looking at content from Sapping Attention, which was my primary blog from 2010 to 2015; I am republishing all items from there on this page, but for the foreseeable future you should be able to read them in their original form at sappingattention.blogspot.com. For current posts, see here.

Similar Trends in Words

Nov 15 2010

Im going to keep looking at the list of isms, because a) theyre fun; and b) the methods we use on them can be used on any group of wordsfor example, ones that we find are highly tied to evolution. So, lets use them as a test case for one of the questions I started out with: how can we find similarities in the historical patterns of emergence and submergence of words?

Well, lets take a few things for granted. First, were interested in relative trends: we want to compare all words, say, that get twice as popular during World War I, regardless of how popular they were before. Its easy to normalize for that, but were still left with a mess of different curves. Below the jump is a very ugly chart, with the adjusted loess curves for the 300 most popular isms, with 1 being a year in which a word is used its average amount.

This chart is unimportant, so I wont try to make it pretty. You can see, though, that a lot of words have dramatic peaks after 1910, a surprising number have peaks around 1880 (thats probably some sort of data artifact I need to work out), and that otherwise a lot of words just meander around. Lets say we pick one of these wordsunionism, say. Ive already marked that above, as the thicker green line that stands out. What if we want to find other words with similar patternsa spike in the 1890s, and then a larger one in the teens that temporarily dips during the Great War? (The Civil War spike, to do with saving the union, is probably too small to get picked up).

There are a lot of ways to do this, but Im going to try just one now. Intuitively, it works like this. Say we ranked each of the isms by how often they appear in 1914. That would give us a rank ordering with some information about the pattern, but not much. Now imagine (Im not going to do this) a scatterplot with the 1914 occurrences rate on one axis, and the 1896 occurrences rate on the other. That would give us a little bit more information, and allow us to cluster some words togetherthose that dont appear much in either year, those that appear a lot in 1914 and a little in 1896, those that appear a lot in 1896 and a lot in 1914, etc. That would be more valuable. We could even add another yearsay the base, 1830and make a three-dimensional plotyou might be able to visualize that, though its trickier.

But what would be best is if we could plot all the years at once, in 93 dimensions. You cant visualize it, but you can do it in a computer. Basically, were seeing how far apart any given two words are at each year, and working all that information together. If we do it for unionism, the closest words, with their normalized distance, are:

collectivism cosmopolitanism     medievalism    emotionalism 
       4.104188        5.958098        7.230171        7.385482

So, collectivism shows the most similar trend to unionism over time. Thats greatit shows that semantically similar words show similar changes in use patterns, which is what wed expect if this were working. The other words make less sense semantically, but thats OK toothere will just be coincidences. Here they are in one of the charts Ive been making. (Which uses a slightly different kind of loess fitting, so the curves will look different).

So far, so good.  Those loess curves dont quite show it, but all of those have a mini-spike around 1895, and then a maxi-spike around the war, perhaps a little lower during. (By the way, these are all using wordcounts, not bookcounts).

An admission: I cheated in choosing unionism, because I knew it was closely tied to collectivism already. So lets see the closest linked words for a few other isms:

> closest.words(darwinism)
 evolutionism      nihilism dilettanteism        theism 
     6.209610      6.922990      7.293343      7.416601

> closest.words(protestantism)
catholicism   dogmatism    stoicism      truism 
   2.243458    2.679069    2.897427    2.912831

> closest.words(militarism)
     infantilism internationalism         feminism      prussianism 
        7.482263         7.557075         7.952049         9.253820

> closest.words(bimetallism)
    bohemianism       anarchism   particularism ecclesiasticism 
       7.685904        8.121062        9.964955       10.562600

> closest.words(monometallism)
bimetallism   anarchism  deafmutism bohemianism 
   10.74706    13.37249    13.78629    13.92116

Not so bad. A mix of close synonyms, suggestive combinations, and probably random coincidences. It doesnt always work like wed hopeI thought bimetallism and monometallism would be much closer than they are, although that seems to be about monometallism having a quite unique curvebut it certainly works. (Its been pointed out that not everyone may know those wordstheyre shorthands for the gold or silver standard in the populist era). Remember, were not using any form of collocation here to see what words appear together except a smoothed version of the year it appears. (It would probably find more closely related concepts if I didnt loess smooth first, but that would tell us less about trends and more about temporary spikes, which we probably dont want for this purpose).

The two words that have the closest curves are, by far, heroism and patriotism. Thats pretty interesting. The two least similar words, a much less useful insight, are phenomenism and bolshevism.

What to do with this tool? Well, we can apply it inside other groups of wordsthe ones that we identified as interesting in their relations to evolution, sayto find groups that trend together around the time were interested in. (We idenfitied heredity, say, as one word that had a suggestive patternwe could find the words that accompanied heredity in its rise, either in general or specifically in books about Darwin).

It also may be interesting to compare the results from this sort of distance analysis to other groupings of wordslike those that appear in the same books. That would let us do a better job at finding words that are actually semantically linked, rather than just words that appear in books from the same year.

~~~~
I have the MySQL database I mentioned earlier running, but a) its too slow, and b) it has trouble with books the internet archive identifies in a certain way in their catalog. This has already been too much work, but I may keep at it. Database optimization and query optimization both seem to be as much art as science, which makes just throwing things together a pain. Is there someone at Princeton whose job it would be to answer my questions about database design?