Posts Tagged ‘statistics’

Graphing Shakespeare

Monday, August 17th, 2015

Today I came across a lovely project from JSTOR & the Folger Library – a set of Shakespeare’s plays, each line annotated by the number of times it is cited/discussed by articles within JSTOR.

“This is awesome”, I thought, “I wonder what happens if you graph it?”

So, without further ado, here’s the “JSTOR citation intensity” for three arbitrarily selected plays:

Blue is numbers of citations per line; red is no citations. In no particular order, a few things that immediately jumped out at me –

  • basically no-one seems to care about the late middle – the end of Act 2 and the start of Act 3 – of A Midsummer Night’s Dream;
  • “… a tale / told by an idiot, full of sound and fury, / signifying nothing” (Macbeth, 5.5) is apparently more popular than anything else in these three plays;
  • Othello has far fewer “very popular” lines than the other two.

Macbeth has the most popular bits, and is also the most densely cited – only 25.1% of its lines were never cited, against 30.3% in Othello and 36.9% in A Midsummer Night’s Dream.

I have no idea if these are actually interesting thoughts – my academic engagement with Shakespeare more or less reached its high-water mark sixteen years ago! – but I liked them…


How to generate these numbers? Copy-paste the page into a blank text file (text), then use the following bash command to clean it all up –

grep "FTLN " text | sed 's/^.*FTLN/FTLN/g' | cut -b 10- | sed 's/[A-Z]/ /g' | cut -f 1 -d " " | sed 's/text//g' > numberedextracts

Paste into a spreadsheet against a column numbered 1-4000 or so, and graph away…

Canadian self-reported birthday data

Sunday, February 22nd, 2015

In the last post, we saw strong evidence for a “memorable date” bias in self-reported birthday information among British men born in the late 19th century. In short, they were disproportionately likely to think they were born on an “important day” such as Christmas.

It would be great to compare it to other sources. However, finding a suitable dataset is challenging. We need a sample covering a large number of men, over several years, and which is unlikely to be cross-checked or drawn from official documentation such as birth certificates or parish registers. It has to explicitly list full birthdates (not just month or year)

WWI enlistment datasets are quite promising in this regard – lots of men, born about the same time, turning up and stating their details without particularly much of a reason to bias individual dates. The main British records have (famously) long since burned, but the Australian and Canadian records survive. Unfortunately, the Australian index does not include dates of birth, but the Canadian index does (at least, when known). So, does it tell us anything?

The index is available as a 770mb+ XML blob (oh, dear). Running this through xmllint produces a nicely formatted file with approximately 575,000 birthdays for 622,000 entries. It’s formatted in such a way as to imply there may be multiple birthdates listed for a single individual (presumably if there’s contradictory data?), but I couldn’t spot any cases. There’s also about ten thousand who don’t have nicely formatted dd/mm/yyyy entries; let’s omit those for now. Quick and dirt but probably representative.

And so…

There’s clearly a bit more seasonality here than in the British data (up in spring, down in winter), but also the same sort of unexpected one-day spikes and troughs. As this is quite rough, I haven’t corrected for seasonality, but we still see something interesting.

The highest ten days are: 25 December (1.96), 1 January (1.77), 17 March (1.56), 24 May (1.52), 1 May (1.38), 15 August (1.38), 12 July (1.36), 15 September (1.34), 15 March (1.3).

The lowest ten days are: 30 December (0.64), 30 January (0.74), 30 October (0.74), 30 July (0.75), 30 May (0.78), 13 November (0.78), 30 August (0.79), 26 November (0.80), 30 March (0.81), 12 December (0.81).

The same strong pattern for “memorable days” that we saw with the UK is visible in the top ten – Christmas, New Year, St. Patrick’s, Victoria Day, May Day, [nothing], 12 July, [nothing], [nothing].

Two of these are distinctively “Canadian” – both 24 May (the Queen’s birthday/Victoria Day) and 12 July (the Orange Order marches) are above average in the British data, but not as dramatically as they are here. Both appear to have been relatively more prominent in late-19th/early-20th century Canada than in the UK. Canada Day/Dominion Day (1 July) is above average but does not show up as sharply, possibly because it does not appear to have been widely celebrated until after WWI.

One new pattern is the appearance of the 15th of the month in the top 10. This was suggested as likely in the US life insurance analysis and I’m interested to see it showing up here. Another oddity is leap years – in the British data, 29 February was dramatically undercounted. In the Canadian data, it’s strongly overcounted – just not quite enough to get into the top ten. 28 February (1.28), 29 February (1.27) and 1 March (1.29) are all “memorable”. I don’t have an explanation for this but it does suggest an interesting story.

Looking at the lowest days, we see the same pattern of 30/xx dates being very badly represented – seven of the ten lowest dates are 30th of the month…. and all from days where there were 31 days in the month. This is exactly the same pattern we observed in UK data, and I just don’t have any convincing reason to guess why. The other three dates all fall in low-birthrate months,

So, in conclusion:

  • Both UK and Canadian data from WWI show a strong bias for people to self-report their birthday as a “memorable day”;
  • “Memorable” days are commonly a known and fixed festival, such as Christmas;
  • Overreporting of arbitrary numbers like the 15th of the month are more common in Canada (& possibly the US?) than the UK;
  • The UK and Canadian samples seem to treat 29 February very differently – Canadians overreport, British people underreport;
  • There is a strong bias against reporting the 30th of the month particularly in months with 31 days

Thoughts (or additional data sources) welcome.

When do you think you were born?

Monday, February 16th, 2015

Back in the last post, we were looking at a sample of dates-of-birth in post-WWI Army records.

(To recap – this is a dataset covering every man who served in the British Army after 1921 and who had a date of birth in or before 1900. 371,716 records in total, from 1864 to 1900, strongly skewed towards the recent end.)

I’d suggested that there was an “echo” of 1914/15 false enlistment in there, but after a bit of work I’ve not been able to see it. However, it did throw up some other very interesting things. Here’s the graph of birthdays.

Two things immediately jump out. The first is that the graph, very gently, slopes upwards. The second is that there are some wild outliers.

The first one is quite simple to explain; this data is not a sample of men born in a given year, but rather those in the army a few decades later. The graph in the previous post shows a very strong skew towards younger ages, so for any given year we’d expect to find marginally more December births than January ones. I’ve normalised the data to reflect this – calculated what the expected value for any given day would be assuming a linear increase, then calculated the ratio of reported to expected births. [For 29 February, I quartered its expected value]

There are hints at a seasonal pattern here, but not a very obvious one. January, February, October and November are below average, March and September above average, and the rest of the spring-summer is hard to pin down. (For quite an interesting discussion on “European” and “American” birth seasonality, see this Canadian paper)

The interesting bit is the outliers, which are apparent in both graphs.

The most overrepresented days are, in order of frequency, 1 January (1.8), 25 December (1.43), 17 March (1.33), 28 February (1.27), 14 February (1.22), 1 May (1.22), 11 November (1.19), 12 August (1.17), 2 February (1.15), and 10 October (1.15). Conversely, the most underrepresented days are 29 February (0.67 after adjustment), 30 July (0.75), 30 August (0.78), 30 January (0.81), 30 March (0.82), and 30 May (0.84).

Of the ten most common days, seven are significant festivals. In order: New Year’s Day, Christmas Day, St. Patrick’s Day, [nothing], Valentine’s Day, May Day, Martinmas, [nothing], Candlemas, [nothing].

Remember, the underlying bias of most data is that it tells you what people put into the system, not what really happened. So, what we have is a dataset of what a large sample of men born in late nineteenth century Britain thought their birthdays were, or of the way they pinned them down when asked by an official. “Born about Christmastime” easily becomes “born 25 December” when it has to go down on a form. (Another frequent artefact is overrepresentation of 1-xx or 15-xx dates, but I haven’t yet looked for this.) People were substantially more likely to remember a birthday as associated with a particular festival or event than they were to remember a random date.

It’s not all down to being memorable, of course; 1 January is probably in part a data recording artefact. I strongly suspect that at some point in the life of these records, someone’s said “record an unknown date as 1/1/xx”.

The lowest days are strange, though. 29 February is easily explained – even correcting for it being one quarter as common as other days, many people would probably put 28 February or 1 March on forms for simplicity. (This also explains some of the 28 February popularity above). But all of the other five are 30th of the month – and all are 30th of a 31-day month. I have no idea what might explain this. I would really, really love to hear suggestions.

One last, and possibly related, point – each month appears to have its own pattern. The first days of the month are overrepresented; the last days underrepresented. (The exception is December and possibly September). This is visible in both normalised and raw data, and I’m completely lost as to what might cause it…

Back to the Army again

Saturday, January 24th, 2015

In the winter of 1918-19, the British government found itself in something of a quandary. On the one hand, hurrah, the war was over! Everyone who had signed up to serve for “three years or the duration” could go home. And, goodness, did they want to go home.

On the other hand, the war… well it wasn’t really over. There were British troops fighting deep inside Russia; there were large garrisons sitting in western Germany (and other, less probable, places) in case the peace talks collapsed; there was unrest around the Empire and fears about Bolsheviks at home.

So they raised another army. Anyone in the army who volunteered to re-enlist got a cash payment of £20 to £50 (no small sum in 1919); two month’s leave with full pay; plus comparable pay to that in wartime and a separation allowance if he was married. Demobilisation continued for everyone else (albeit slowly), and by 1921, this meant that everyone in the Army was either a very long-serving veteran, a new volunteer who’d not been conscripted during wartime (so born 1901 onwards) or – I suspect the majority – re-enlisted men working through their few years service.

For administrative convenience, all records of men who left up to 1921 were set aside and stored by a specific department; the “live” records, including more or less everyone who reenlisted, continued with the War Office. They were never transferred – and, unlike the pre-1921 records, they were not lost in a bombing raid in 1940.

The MoD has just released an interesting dataset following an FOI request – it’s an index of these “live” service records. The records cover all men in the post-1921 records with a DoB prior to 1901, and thus almost everyone in it would have either remained in service or re-enlisted – there would be a small proportion of men born in 1900 who escaped conscription (roughly 13% of them would have turned 18 just after 11/11/18), and a small handful of men will have re-enlisted or transferred in much later, but otherwise – they all would have served in WWI and chosen to remain or to return very soon after being released.

So, what does this tell us? Well, for one thing, there’s almost 317,000 of them. 4,864 were called Smith, 3,328 Jones, 2,104 Brown, 1,172 Black, etc. 12,085 were some form of Mac or Mc. And there are eight Singhs, which looks like an interesting story to trace about early immigrants.

But, you know, data cries out to be graphed. So here’s the dates of birth.

Since the 1900 births are probably an overcount for reenlistments, I’ve left these off.

It’s more or less what you’d expect, but on close examination a little story emerges. Look at 1889/90; there’s a real discontinuity here. Why would this be?

Pre-war army enlistments were not for ‘the duration’ (there was nothing to have a duration of!) but for seven years service and five in the reserves. There was a rider on this – if war broke out, you wouldn’t be discharged until the crisis was over. The men born 1900 would have enlisted in 1908 and been due for release to the reserves in 1915. Of course, that never happened… and so, in 1919, many of these men would have been 29, knowing no other career than soldiering. Many would have been thrilled to get out – and quite a few more would have considered it, and realised they had no trade, and no great chance of good employment. As Kipling had it in 1894:

A man o’ four-an’-twenty what ‘asn’t learned of a trade—
Except “Reserve” agin’ him—’e’d better be never made.

It probably wasn’t much better for him in 1919.

Moving right a bit, 1896-97 also looks odd – this is the only point in the data where it goes backwards, with marginally more men born in 1896 than 1897. What happened here?

Anyone born before August 1896 was able to rush off and enlist at the start of the war; anyone born after that date would either have to wait, or lie. Does this reflect a distant echo of people giving false ages in 1914/15 and still having them on the paperwork at reenlistment? More research no doubt needed, but it’s an interesting thought.

Quality versus age of Wikipedia’s Featured Articles

Friday, April 16th, 2010

There’s been a brief flurry of interest on Wikipedia in this article, published last week:

Evaluating quality control of Wikipedia’s feature articles – David Lindsey.

…Out of the Wikipedia articles assessed, only 12 of 22 were found to pass Wikipedia’s own featured article criteria, indicating that Wikipedia’s process is ineffective. This finding suggests both that Wikipedia must take steps to improve its featured article process and that scholars interested in studying Wikipedia should be careful not to naively believe its assertions of quality.

A recurrent objection to this has been that Lindsey didn’t take account of the age of articles – partly because article quality can degrade over time, since the average contribution is likely to be below the quality of the remainder of the article if it began at a high level, and partly because the relative stringency of what constitutes “featured” has changed over time.

The interesting thing is, this partly holds and partly doesn’t. The article helpfully “scored” the 22 articles reviewed on a reasonably arbitrary ten-point scale; the average was seven, which I’ve taken as the cut-off point for acceptability. If we graph quality against time – time being defined as the last time an article passed through the “featuring” process, either for the first time or as a review – then we get an interesting graph:

Here, I’ve divided them into two groups; blue dots are those with a rating greater than 7, and thus acceptable; red dots are those with a rating lower than 7, and so insufficient. It’s very apparent that these two cluster separately; if an article is good enough, then there is no relation between the current status and the time since it was featured. If, however, it is not good enough, then there is a very clear linear relationship between quality and time. The trendlines aren’t really needed to point this out, but I’ve included them anyway; note that they share a fairly similar origin point.

Two hypotheses could explain this. Firstly, the quality when first featured varies sharply over time, but most older articles have been brought up to “modern standards”. Secondly, the quality when first featured is broadly consistent over time, and most articles remain that level, but some decay, and that decay is time-linked.

I am inclined towards the second. If it was the first, we would expect to see some older articles which were “partially saved” – say, one passed when the average scoring was three, and then “caught up” when the average scoring was five. This would skew the linearity of the red group, and make it more erratic – but, no, no sign of that. We also see that the low-quality group has no members older than about three years (1100 days); this is consistent with a sweeper review process which steadily goes through old articles looking for bad ones, and weeding out or improving the worst.

(The moral of the story? Always graph things. It is amazing what you spot by putting things on a graph.)

So what would this hypothesis tell us? Assuming our 22 are a reasonable sample – which can be disputed, but let’s grant it – the data is entirely consistent with all of them being of approximately the same quality when they first become featured; so we can forget about it being a flaw in the review process, it’s likely to be a flaw in the maintenance process.

Taking our dataset, the population of featured articles falls into two classes.

  • Type A – quality is consistent over time, even up to four years (!), and they comply with the standards we aim for when they’re first passed.
  • Type B – quality decays steadily with time, leaving the article well below FA status before even a year has passed.

For some reason, we are doing a bad job of maintaining the quality of about a third of our featured articles; why, and what distinguishes Type B from Type A? My first guess was user activity, but no – of those seven, in only one case has the user who nominated it effectively retired from the project.

Could it be contentiousness? Perhaps. I can see why Belarus and Alzheimer’s Disease may be contentious and fought-over articles – but why Tōru Takemitsu, a well-regarded Japanese composer? We have a decent-quality article on global warming, and you don’t get more contentious than that.

It could be timeliness – an article on a changing topic can be up-to-date in 2006 and horribly dated in 2009 – which would explain the problem with Alzheimer’s, but it doesn’t explain why some low-quality articles are on relatively timeless topics – Takemitsu or the California Gold Rush – and some high-quality ones are on up-to-date material such as climate change or the Indian economy.

There must be something linking this set, but I have to admit I don’t know what it is.

We would be well-served, I think, to take this article as having pointed up a serious problem of decay, and start looking at how we can address that, and how we can help maintain the quality of all these articles. Whilst the process for actually identifying a featured article at a specific point in time seems vindicated – I am actually surprised we’re not seeing more evidence of lower standards in the past – we’re definitely doing our readers a disservice if the articles rapidly drop below the standards we advertise them as holding.

Crime statistics

Wednesday, February 3rd, 2010

A couple of interesting blog posts on the BBC – part 1, part 2 – about a recent set of crime statistics publicised by the Conservatives.

The basic gist of the Conservative claim is that violent crime is vastly increased over the past decade; the basic problem is that the method of recording violent crime changed in the middle of the period, to a much more “permissive” approach, where police were obliged to record a complaint rather than dismissing it. Which, unsurprisingly, tends to lead to a lot more reported crime, without actually saying anything about the underlying crime rates.

I suppose in an ideal world Labour would be running a campaign of “Do you really want to be governed by people who can’t read printed warnings on graphs?”, but sadly all we’ll get is a bit of he-said-she-said over the next two weeks and a few more people will be left beliving that the country is a far scarier place now than it ever was.

Demographics in Wikipedia

Friday, January 29th, 2010

There’s a lengthy internal debate going on in Wikipedia at the moment (see here, if you really want to look inside the sausage factory) about how best to deal with the perennial article of biographies of living people, of which there are about 400,000.

As an incidental detail to this, people have been examining the issue from all sorts of angles. One particularly striking graph that’s been floating around shows the number of articles marked as being born or died in any given year from the past century:


User:Carcharoth

As the notes point out, we can see some interesting effects here. Firstly – and most obviously – is the “recentism”; people who are alive and active in the present era tend to be more likely to have articles written about them, so you get more very recent deaths than (say) people who died forty years ago. Likewise, you have a spike around the late 1970s / early 1980s of births of people who’re just coming to public attention – in other words, people in their early thirties or late twenties are more likely to have articles written about them.

If we look back with a longer-term perspective, we can see that the effects of what Wikipedia editors have chosen to write about diminish, and the effects of demographics become more obvious. There are, for example, suggestions of prominent blips in the deathrate during the First and Second World Wars, and what may be the post-war baby boom showing up in the late 1940s.

So, we can distinguish two effects; underlying demographics, and what people choose to write about.

(In case anyone is wondering: people younger than 25 drop off dramatically. The very youngest are less than a year old, and are invariably articles about a) heirs to a throne; b) notorious child-murder cases; c) particularly well-reported conjoined twins or other multiple births. By about the age of five you start getting a fair leavening of child actors and the odd prodigy.)

Someone then came up with this graph, which is the same dataset drawn from the French Wikipedia:


User:Pymouss

At a glance, they look quite similar, which tells us that the overall dynamic guiding article-writing is broadly the same in both cases. This doesn’t sound that drastic a change, but different language editions can vary quite dramatically in things like standards for what constitutes a reasonable topic, so it is useful to note. French has a more pronounced set of spikes in WWI, WWII, and the post-war baby boom, though, as well as a very distinctive lowering of the birthrate during WWI. These are really quite interesting, especially the latter one, because it suggests we’re seeing a different underlying dynamic. And the most likely underlying dynamic is, of course, that Francophones tend to prefer writing about Francophones, and Anglophones tend to prefer writing about Anglophones…

So, how does this compare in other languages? I took these two datasets, and then added Czech (which someone helpfully collected), German and Spanish. (The latter two mean we have four of the five biggest languages represented. I’d have liked to include Polish, but the data was not so easily accessible.) I then normalised it, so each year was a percentage of the average for that language for that century, and graphed them against each other:

What can we see from these? Overall, every project has basically the same approach to inclusion; ramping up steadily over time, a noticeable spike in people who died during WWII or in the past two decades, and a particular interest in people who are about thirty and in the public eye. There is one important exception to this last case – German, which has a flat birthrate from about 1940 onwards, and apparently no significant recentism in this regard. The same is true of Czech to a limited degree. (Anecdotally I believe the same may be true of Japanese, but I haven’t managed to gather the data yet)

The WWII death spike is remarkably prominent in German and Czech, moderately prominent in French, and apparent but less obvious in English and Spanish. This could be differential interest in military history, where biographies tend to have deaths clustered in wartime, but it also seems rational to assume this reflects something of the underlying language-biased data. More Central Europeans died in WWII than Western Europeans; proportionally fewer died in the Anglosphere because English-speaking civilian populations escaped the worst of it, and the Spanish-speaking world was mostly uninvolved. The deaths in WWI are a lot more tightly clustered, and it’s hard to determine anything for sure here.

The other obvious spike in deaths is very easy to understand from either interpretation of the reason; it’s in 1936, in Spanish, which coincides with the outbreak of the Civil War. Lots of people to write articles about, there, and people less likely to be noted outside of Spain itself.

I mentioned above that (older) birthrates are more likely to represent an underlying demographic reality than deathrates are; localised death rates could be altered by a set of editors who choose to write on specific themes. You’d only get a birthdate spike, it seems, if someone was explicitly choosing to write about people born in a specific period; it’s hard to imagine it from a historical perspective. Historically linked people are grouped by when they’re prominent and active, and that happens at a variable time in their lives, so someone specifically writing about a group of people is likely to “smear” out their birthdates in a wide distribution.

So, let’s look at the historic births graph and see if anything shows up there. German and French show very clear drops in the birth rate between 1914 and about 1920, round U-shaped falls. German appears to have a systemic advantage over the other projects in birthrate through the 1930s and 1940s, though as the data is normalised against an average this may be misleadingly inflated – it doesn’t have the post-1970 bulge most languages do. The very sharp drop in births in 1945 is definitely not an artefact, though; you can see it to a lesser degree in the other languages, except English, where it’s hardly outside normal variance.

So, there does seem to be a real effect here; both these phenomena seem predictable as real demographic events, and the difference between the languages is interpretable as different populations suffering different effects in these periods and being represented to different degrees in the selection of people by various projects.

The next step would be, I suppose, to compare those figures to known birth and death rates both globally and regionally over the period; this would let us estimate of the various degrees of “parochialism” involved in the various projects’ coverage of people, as well as the varying degrees of “recentness” which we’ve seen already. Any predictions?