You are looking at archived content from my "Bookworm" blog, an experiment that ran from 2014-2016. Not all content may work. For current posts, see here.

Hansard

Dec 15 2015

A first pass at understanding the potential of the Hansard corpus through a Bookworm browser.

Ive divided up the native XML by using the intrinsic speaker tag into a variety of individual speeches.

A speech can be very short; on average, each one in the Hansard corpus is 225 words.

There are, though, a lot of them; 6.2 million overall. That gets us just over the billion word marker for this set.

Heres the number of speeches by year and by house of Parliament.

Number of speeches per year by house.

{   "database": "hansard",
    "plotType": "linechart",
    "search_limits": {
        "date_year":{"$gte":1776,"$lte":2020}
    },"words_collation":"Case_Sensitive",
    "aesthetic": {
        "x": "date_year",
        "y" : "TextCount","color":"house"
    }}

There are a few years where a lot of speeches in commons, in particular, seem to go missing. I think this is partly because the XML structure doesnt include ids for all speeches in all years.

A bookworm browser for any individual terms.

{   "database": "hansard",
    "plotType": "linechart",
    "search_limits": {"word":["Europe"],"house":["commons"],
        "date_year":{"$gte":1776,"$lte":2020}
    },"words_collation":"Case_Sensitive",
    "aesthetic": {
        "x": "date_year",
        "y" : "WordsPerMillion","color":"house"
    }}

The most interesting data in here is the individual speaker attributions.

Here are the 25 most frequent speakers in the corpus, with the x axis showing their years.

The top 25 speakers by year.

{   "database": "hansard",
    "plotType": "heatmap",
    "search_limits": {"speaker__id":{"$lte":25},
        "date_year":{"$gte":1776,"$lte":2020}
    },"words_collation":"Case_Sensitive",
    "aesthetic": {
        "x": "date_year",
        "y" : "speaker","color":"TextCount"
    }}

Although this is the most interesting portion of the corpus, its also the one that would take the most work.

In the XML, speakers are identified differently on first appearance for the day from subsequent ones. Titles are used at the preference of actual names. (Note that the numbers drop greatly for MR. CHURCHILL and Mr. Eden when they are in office.)

The top 25 speakers by year.

{   "database": "hansard",
    "plotType": "heatmap",
    "search_limits": {"speaker__id":{"$lte":25},
        "date_year":{"$gte":1776,"$lte":2020}
    },"words_collation":"Case_Sensitive",
    "aesthetic": {
        "x": "date_year",
        "y" : "speaker","color":"TextCount"
    }}

Better disambiguation might make it possible to do some interesting in-browser comparisons; how did the prime ministers rhetoric change between the 19th and 20th centuries? How did Browns speech change at various points? I would really like to have this information on the US congress, where I know the personalities enough to have some interesting questions.

Excluding most titles of office, heres a rank-ordered browser for any word by frequency (ie, not by total number of uses) for the most common speakers in the set.

{
"database": "hansard",
"plotType": "barchart",
"method": "return_json",
"search_limits": {
"gender":["Male","Female"],
"word": ["we"],
"speaker__id": {
"$lte": 50
}
},
"aesthetic": {
"x": "WordsPerMillion",
"y": "speaker"
},
"counttype": ["WordsPerMillion"],
"groups": ["speaker"]
}

Even more valuable would be party affiliations. I think this may be in the XML headers; if so, it may not be too difficult to tag by party (or even ideology, assuming British political scientists have vote data on a par with what Americans have). Name disambiguation may be a problem, though.

Gender

One thing we do have, easily, is gender. A quick regular expression can pull out whether a speaker is Mr, Ms, Mrs, or Madam and do a much better job at knowing gender than more standard means.

Heres number of words (not speeches) by gender.

Percentage of words per year by gender, post-1940

{   "database": "hansard",
    "plotType": "linechart",
    "search_limits": {
        "date_year":{"$gte":1940,"$lte":2020}
    },"words_collation":"Case_Sensitive",
    "aesthetic": {
        "x": "date_year",
        "y" : "TextPercent",
        "color":"*gender"
    }}

A broad gender-to-gender comparison will get us nothing, but maybe just looking at differing language by gender since 1980 can produce something useful.

The chart is ugly (Im including the house of lords, where everyone is Lord Whateverhampton and so not gender-tagged just to make the x axis sit at zero)

Differing use of language by gender since 1980.

{   "database": "hansard",
    "plotType": "pointchart",
    "search_limits": {"word":["we"],
    "date_year":{"$gte":1980,"$lte":2020},
    "gender":["Male","Female"]
    },"words_collation":"Case_Sensitive",
    "aesthetic": {
        "y": "house",
        "x" : "WordsPerMillion","color":"gender"
    }}