Back in the last post, we were looking at a sample of dates-of-birth in post-WWI Army records.
(To recap – this is a dataset covering every man who served in the British Army after 1921 and who had a date of birth in or before 1900. 371,716 records in total, from 1864 to 1900, strongly skewed towards the recent end.)
I’d suggested that there was an “echo” of 1914/15 false enlistment in there, but after a bit of work I’ve not been able to see it. However, it did throw up some other very interesting things. Here’s the graph of birthdays.
Two things immediately jump out. The first is that the graph, very gently, slopes upwards. The second is that there are some wild outliers.
The first one is quite simple to explain; this data is not a sample of men born in a given year, but rather those in the army a few decades later. The graph in the previous post shows a very strong skew towards younger ages, so for any given year we’d expect to find marginally more December births than January ones. I’ve normalised the data to reflect this – calculated what the expected value for any given day would be assuming a linear increase, then calculated the ratio of reported to expected births. [For 29 February, I quartered its expected value]
There are hints at a seasonal pattern here, but not a very obvious one. January, February, October and November are below average, March and September above average, and the rest of the spring-summer is hard to pin down. (For quite an interesting discussion on “European” and “American” birth seasonality, see this Canadian paper)
The interesting bit is the outliers, which are apparent in both graphs.
The most overrepresented days are, in order of frequency, 1 January (1.8), 25 December (1.43), 17 March (1.33), 28 February (1.27), 14 February (1.22), 1 May (1.22), 11 November (1.19), 12 August (1.17), 2 February (1.15), and 10 October (1.15). Conversely, the most underrepresented days are 29 February (0.67 after adjustment), 30 July (0.75), 30 August (0.78), 30 January (0.81), 30 March (0.82), and 30 May (0.84).
Of the ten most common days, seven are significant festivals. In order: New Year’s Day, Christmas Day, St. Patrick’s Day, [nothing], Valentine’s Day, May Day, Martinmas, [nothing], Candlemas, [nothing].
Remember, the underlying bias of most data is that it tells you what people put into the system, not what really happened. So, what we have is a dataset of what a large sample of men born in late nineteenth century Britain thought their birthdays were, or of the way they pinned them down when asked by an official. “Born about Christmastime” easily becomes “born 25 December” when it has to go down on a form. (Another frequent artefact is overrepresentation of 1-xx or 15-xx dates, but I haven’t yet looked for this.) People were substantially more likely to remember a birthday as associated with a particular festival or event than they were to remember a random date.
It’s not all down to being memorable, of course; 1 January is probably in part a data recording artefact. I strongly suspect that at some point in the life of these records, someone’s said “record an unknown date as 1/1/xx”.
The lowest days are strange, though. 29 February is easily explained – even correcting for it being one quarter as common as other days, many people would probably put 28 February or 1 March on forms for simplicity. (This also explains some of the 28 February popularity above). But all of the other five are 30th of the month – and all are 30th of a 31-day month. I have no idea what might explain this. I would really, really love to hear suggestions.
One last, and possibly related, point – each month appears to have its own pattern. The first days of the month are overrepresented; the last days underrepresented. (The exception is December and possibly September). This is visible in both normalised and raw data, and I’m completely lost as to what might cause it…