Gender and BLPs on Wikipedia, redux

Back in 2019 I generated some data on how the English Wikipedia’s biographies of living people (BLPs) broke down by gender, and how that intersected with creation and deletion rates. The headline figures were that:

  • There was a significant difference between the gender split for all biographies, at 17.8% women – and for biographies of living people (BLPs), 22.7%.
  • In 2009, around 20% of existing BLPs were on women. As time went on, the average share of BLPs increased slowly, by perhaps a quarter of a percentage point per year.
  • In 2009, around 20% of newly created BLPs were on women. In about 2012, this kicked up a gear, rising above the long term average – first to about 25%, peaking around 33% before falling back a little.
  • BLP articles on women were more likely to be nominated for deletion until about 2017, when the effect disappeared.

One thing that was raised during the subsequent discussion was that a lot of the skew by gender was potentially linked to subject area – I was able to identify that for athletes (defined broadly, all sports players) the articles were much more likely to be men. I didn’t investigate this too much, though. Last week, I was reminded about this, and I’ve been looking at the numbers again. It brought up two interesting divergences.

Please don’t make me read all this

Okay – but you’ll miss the graphs. In summary:

English Wikipedia has more women in recent cohorts (about ~25% of living people born since the seventies) and there are far more men among athletes. Since the athletes make up a staggeringly high amount of articles among younger subjects, the gender split among non-athletes is much more balanced – a little under a third overall, but breaking 50% female among the younger cohorts.

Still with me? Let’s start. Sorry about the spoilers.

Time and tide

The first phenomenon is very straightforward: while the overall percentage across all people is around 25% women, how that is distributed over time varies. In general, there is a steady rise until about the 1970s; for those born from the 1970s onwards, the generation who are currently in their active working lives, the level is relatively stable at around 25% women.

The exception is those born in the 1920s (where it sits at 26%) – this is presumably affected by the fact that at this point, female life expectancy is significantly higher than male, and so the proportion of women begins to rise as a result.

One surprising outcome, however, is that the share of living people with no recorded age (green) is much more female than the average. This is a large cohort – there are in fact slightly more women in it than in any individual decade. I believe that it skews young – in other words, were this information known, it would increase the share of women in recent decades – but it is hard to find a way to confirm this. This issue is discussed in more detail below.

(Those born in the 2010s/20s and in the 1900s/10s are omitted – the four groups have a total of 175 articles, while the cohorts shown range from 5,000 to 170,000 – but the levels are around 50%. This is likely due to life expectancy in the oldest cohorts, and the fact that the people in the youngest cohorts are mostly notable at this point as being “the child of someone famous” – which you would broadly expect to be independent of gender.)

The percentages shown here are of the total male + female articles, but it is also possible to calculate the share of people who have a recorded gender that is not male/female. These show a very striking rise over time, though it should be cautioned that the absolute numbers are small – the largest single cohort is the 1980s with 345 people out of 170,000.

Sports by the numbers

The original question was to look at what the effect of athlete articles is on the overall totals. It turns out… very striking.

A few things are immediately apparent. The first is that the share of athletes is very substantial – it reflects only around a quarter of people born in the 1950s, but 85-90% of people born in the 1990s/2000s.

The second is that those athletes are overwhelmingly men – among the 1950s cohort, only about 10% of those athletes are female, and even by recent years it is only around 20%. This means that if we look purely at the non-athlete articles, the gender split becomes a lot more balanced.

Across all articles, it is around 32% female. But among living non-athletes, born since 1990, the gender balance is over 50% female.

This is a really amazing figure. I don’t think I ever particularly expected to see a gender analysis on Wikipedia that would break 50%. Granted, the absolute numbers involved are low – as is apparent from the previous graph, “non-athletes born in the 1990s” is around 22,000 people, and “born in the 2000s” is as low as 2,500 – but it’s a pretty solid trend and the total numbers for the earlier decades are definitely large enough for it to be no anomaly.

(Eagle-eyed readers will note that these do not quite align with the numbers in the original linked discussion – those were a couple of points lower in recent decades. I have not quite worked out why, but I think this was an error in the earlier queries; possibly it was counting redirects?)

One last detail to note: the “date missing” cohort comes out over 90% non-athletes. Presumably this is because their exact age is often significant and linked in to eg when they start professional sports, so it’s easily publicly available.

Methodology: the thousand word footnote

Feel free to let your eyes glaze over now.

These numbers were constructed mostly using the petscan tool, and leveraging data from both English Wikipedia and Wikidata. From Wikipedia, we have a robust categorisation system for year/decade of birth, and for whether someone is a living person. From Wikidata, we have fairly comprehensive gender data, which Wikipedia doesn’t know about. (It also has dates of birth, but it is more efficient to use WP categories here). So it is straightforward to produce intersection queries like “all living people marked as 1920s births and marked as female” (report). Note that this is crunching a lot of data – don’t be surprised if queries take a minute or two to run or occasionally time out.

To my surprise, the report for “living people known to be female” initially produced a reliable figure, but one for “living people known to be male” produced a figure that was an undercount. (I could validate this by checking against some small categories where I could run a report listing the gender of every item). The root cause seemed to be a timeout in the Wikidata query – I was originally looking for { ?item wdt:P31 wd:Q5 . wdt:P21 wd:Q6581097 } – items known to be human with gender male. Tweaking this to be simply { ?item wdt:P21 wd:Q6581097 } – items with gender male – produced a reliable figure. Similarly, we had the same issue when trying to get a total for all items with reported gender – simply { ?item wdt:P21 ?val } works.

Percentages are calculated as percentage of the number of articles identified as (male + female), rather than of all BLPs with a recorded gender value or simply of all BLPs. There are good arguments for either of the first two, but the former is simpler (some of my “any recorded gender value” queries timed out) and also consistent with the 2019 analysis.

A thornier problem comes from the sports element. There are a number of potential ways we could determine “sportiness”. The easiest option would be to use Wikidata occupation and look for something that indicates their occupation is some form of athlete, or that indicates a sport being played. The problem is that this is too all-encompassing, and would give us people who played sports but for whom it is not their main claim to fame. An alternative is to use the Wikipedia article categorisation hierarchy, but this is very complex and deep, making the queries very difficult to work with. The category hierarchy includes a number of surprise crosslinks and loops, meaning that deep queries tend to get very confusing results, or just time out.

The approach I eventually went with was to use Wikipedia’s infoboxes – the little standardised box on the top right of a page. There are a wide range of distinct infobox templates tailored to specific fields; each article usually only displays one, but can embed elements of others to bring in secondary data. If we look for articles using one of the 77(!) distinct sports infoboxes (report), we can conclude they probably had a significant sporting career. An article that does not contain one can be inferred to not have a sporting background.

But then we need to consider people with significant sports and non-sports careers. For example, the biographies of both Seb Coe and Tanni Grey-Thompson use the “infobox officeholder” to reflect their careers in Parliament being more recent, but it is set up to embed a sports infobox towards the end. This would entail them being counted as athletes by our infobox method. This is probably correct for those two, but there are no doubt people out there where we would draw the line differently. (To stay in the UK political theme: how about Henry McLeish? His athletic career on its own would probably just qualify for a Wikipedia biography, but it is, perhaps, a bit of a footnote compared to being First Minister…)

So, here is another complication. How reliable is our assumption that an athlete has a sports infobox, and that non-athletes don’t? If it’s broadly true, great, our numbers hold up. If it’s not, and if it’s not in some kind of systematic way, there might be a more complex skew. I believe that for modern athletes, it’s reasonably safe to assume that infoboxes are nearly ubiquitous; there are groups of articles where they’re less common, but this isn’t one of them. However, I can’t say for sure; it’s not an area I’ve worked intensively in.

Finally, we have the issue of dates. We’ve based the calculation on Wikipedia categories. Wikipedia birth/death categories are pretty reliably used where that data is known. However, about 150k (14%) of our BLP articles are marked “year of birth unknown”, and these are disproportionately female (35.4%).

What effect do these factors have?

Counting the stats as percentage of M+F rather than percentage of all people with recorded gender could be argued either way, but the numbers involved are quite low and do not change the overall pattern of the results.

The infobox question is more complicated. It is possible that it is meaning we are not picking up all athletes because they do not have infoboxes. On the other hand, it is possible that it is meaning we are being more expansive in counting people as athletes because they have a “secondary” infobox along the line of Coe & Grey-Thompson above. The problem there is defining where we draw the line, and what level of “other significance” stops someone being counted. That feels like a very subjective threshold and hard to test for automatically. It is certainly a more conservative test than a Wikidata-based one, at least.

And for dates, hmm. We know that the articles that do not report an age are disproportionately female (35% vs the BLP average of 25%), but also that they are even more disproportionately “not athletes” (7% athletes vs the BLP average of 43%). There are also a lot of articles that don’t report an age; around 14% of all BLPs.

This one probably introduces the biggest question mark here. Depending on how that 14% break down, it could change the totals for the year-by-year cohorts; but there’s not really much we can do at the moment to work that out.

Anecdotally, I suspect that they are more likely to skew younger rather than being evenly distributed over time, but there is very little to go on here. However, I feel it is unlikely they would be distributed in such a way as to counteract the overall conclusions – this would require, for example, the female ones being predominantly shifted into older groups and the male ones into younger groups. It’s possible, but I don’t see an obvious mechanism to cause that.

[Edit 4/8/23 – tweaked to confirm that these are English Wikipedia only figures, after a reminder from Hilda. I would be very interested in seeing similar data for other projects, but the methodology might be tricky to translate – eg French and German do not have an equivalent category for indexing living people, and different projects may have quite different approaches for applying infoboxes.]

Gender and deletion on Wikipedia

So, a really interesting question cropped up this weekend:

I’m trying to find out how many biographies of living persons exist on the English Wikipedia, and what kind of data we have on them. In particular, I’m looking for the gender breakdown. I’d also like to know when they were created; average length; and whether they’ve been nominated for deletion.

This is, of course, something that’s being discussed a lot right now; there is a lot of emerging push-back against the excellent work being done to try and add more notable women to Wikipedia, and one particular deletion debate got a lot of attention in the past few weeks, so it’s on everyone’s mind. And, instinctively, it seems plausible that there is a bias in the relative frequency of nomination for deletion – can we find if it’s there?

My initial assumption was, huh, I don’t think we can do that with Wikidata. Then I went off and thought about it for a bit more, and realised we could get most of the way there of it with some inferences. Here’s the results, and how I got there. Thanks to Sarah for prompting the research!

(If you want to get the tl;dr summary – yes, there is some kind of difference in the way older male vs female articles have been involved with the deletion process, but exactly what that indicates is not obvious without data I can’t get at. The difference seems to have mostly disappeared for articles created in the last couple of years.)

Statistics on the gender breakdown of BLPs

As of a snapshot of yesterday morning, 5 May 2019, the English Wikipedia had 906,720 articles identified as biographies of living people (BLPs for short). Of those, 697,402 were identified as male by Wikidata, 205,117 as female, 2464 had some other value for gender, 1220 didn’t have any value for gender (usually articles on groups of people, plus some not yet updated), and 517 simply didn’t have a connected Wikidata item (yet). Of those with known gender, it breaks down as 77.06% male, 22.67% female, and 0.27% some other value. (Because of the limits of the query, I didn’t try and break down those in any more detail.)

This is, as noted, only articles about living people; across all 1,626,232 biographies in the English Wikipedia with a gender known to Wikidata, it’s about 17.83% female, 82.13% male, and 0.05% some other value. I’ll be sticking to data on living people throughout this post, but it’s interesting to compare the historic information.

So, how has that changed over time?

BLPs by gender and date of creation

This graph shows all existing BLPs, broken down by gender and (approximately) when they were created. As can be seen, and as might be expected, the gap has closed a bit over time.

Percentage of BLPs which are female over time

Looking at the ratio over time (expressed here as %age of total male+female), the relative share of female BLPs was ~20% in 2009. In late 2012, the rate of creation of female BLPs kicked up a gear, and from then on it’s been noticeably above the long-term average (almost hitting 33% in late 2017, but dropping back since then). This has driven the overall share steadily and continually upwards, now at 22.7% (as noted above).

Now the second question, do the article lengths differ by gender? Indeed they do, by a small amount.

BLPs by current article size and date of creation

Female BLPs created at any time since 2009 are slightly longer on average than male ones of similar age, with only a couple of brief exceptions; the gap may be widening over the past year but it’s maybe too soon to say for sure. Average difference is about 500 bytes or a little under 10% of mean article size – not dramatic but probably not trivial either. (Pre-2009 articles, not shown here, are about even on average)

Note that this is raw bytesize – actual prose size will be smaller, particularly if an article is well-referenced; a single well-structured reference can be a few hundred characters. It’s also the current article size, not size at creation, hence why older articles tend to be longer – they’ve had more time to grow. It’s interesting to note that once they’re more than about five years old they seem to plateau in average length.

Finally, the third question – have they been nominated for deletion? This was really interesting.

Percentage of BLPs which have previously been to AFD, by date of creation and gender

So, first of all, some caveats. This only identifies articles which go through the structured “articles for deletion” (AFD) process – nomination, discussion, decision to keep or delete. (There are three deletion processes on Wikipedia; the other two are more lightweight and do not show up in an easily traceable form). It also cannot specifically identify if that exact page was nominated for deletion, only that “an article with exactly the same page name has been nominated in the past” – but the odds are good they’re the same if there’s a match. It will miss out any where the article was renamed after the deletion discussion, and, most critically, it will only see articles that survived deletion. If they were deleted, I won’t be able to see them in this analysis, so there’s an obvious survivorship bias limiting what conclusions we can draw.

Having said all that…

Female BLPs created 2009-16 appear noticeably more likely than male BLPs of equivalent age to have been through a deletion discussion at some point in their lives (and, presumably, all have been kept). Since 2016, this has changed and the two groups are about even.

Alongisde this, there is a corresponding drop-off in the number of articles created since 2016 which have associated deletion discussions. My tentative hypothesis is that articles created in the last few years are generally less likely to be nominated for deletion, perhaps because the growing use of things like the draft namespace (and associated reviews) means that articles are more robust when first published. Conversely, though, it’s possible that nominations continue at the same rate, but the deletion process is just more rigorous now and a higher proportion of those which are nominated get deleted (and so disappear from our data). We can’t tell.

(One possible explanation that we can tentatively dismiss is age – an article can be nominated at any point in its lifespan so you would tend to expect a slowly increasing share over time, but I would expect the majority of deletion nominations come in the first weeks and then it’s pretty much evenly distributed after that. As such, the drop-off seems far too rapid to be explained by just article age.)

What we don’t know is what the overall nomination for deletion rate, including deleted articles, looks like. From our data, it could be that pre-2016 male and female articles are nominated at equal rates but more male articles are deleted; or it could be that pre-2016 male and female articles are equally likely to get deleted, but the female articles are nominated more frequently than they should be. Either of these would cause the imbalance. I think this is very much the missing piece of data and I’d love to see any suggestions for how we can work it out – perhaps something like trying to estimate gender from the names of deleted articles?

Update: Magnus has run some numbers on deleted pages, doing exactly this – inferring gender from pagenames. Of those which were probably a person, ~2/3 had an inferred gender, and 23% of those were female. This is a remarkably similar figure to the analysis here (~23% of current BLPs female; ~26% of all BLPs which have survived a deletion debate female)

So in conclusion

  • We know the gender breakdown: skewed male, but growing slowly more balanced over time, and better for living people than historical ones.
  • We know the article lengths; slightly longer for women than men for recent articles, about equal for those created a long time ago.
  • We know that there is something different about the way male and female biographies created before ~2017 experience the deletion process, but we don’t have clear data to indicate exactly what is going on, and there are multiple potential explanations.
  • We also know that deletion activity seems to be more balanced for articles in both groups created from ~2017 onwards, and that these also have a lower frequency of involvement with the deletion process than might have been expected. It is not clear what the mechanism is here, or if the two factors are directly linked.

How can you extract this data? (Yes, this is very dull)

The first problem was generating the lists of articles and their metadata. The English Wikipedia category system lets us identify “living people”, but not gender; Wikidata lets us identify gender (property P21), but not reliably “living people”. However, we can creatively use the petscan tool to get the intersection of a SPARQL gender query + the category. Instructing it to explicitly use Wikipedia (“enwiki” in other sources > manual list) and give output as a TSV – then waiting for about fifteen minutes – leaves you with a nice clean data dump. Thanks, Magnus!

(It’s worth noting that you can get this data with any characteristic indexed by Wikidata, or any characteristic identifiable through the Wikipedia category schema, but you will need to run a new query for each aspect you want to analyse – the exported data just has article metadata, none of the Wikidata/category information)

The exported files contain three things that are very useful to us: article title, pageid, and length. I normalised the files like so:

grep [0-9] enwiki_blp_women_from_list.tsv | cut -f 2,3,5 > women-noheader.tsv

This drops the header line (it’s the only one with no numeric characters) and extracts only the three values we care about (and conveniently saves about 20MB).

This gives us two of the things we want (age and size) but not deletion data. For that, we fall back on inference. Any article that is put through the AFD process gets a new subpage created at “Wikipedia:Articles for deletion/PAGENAME”. It is reasonable to infer that if an article has a corresponding AFD subpage, it’s probably about that specific article. This is not always true, of course – names get recycled, pages get moved – but it’s a reasonable working hypothesis and hopefully the errors are evenly distributed over time. I’ve racked my brains to see if I could anticipate a noticeable difference here by gender, as this could really complicate the results, but provisionally I think we’re okay to go with it.

To find out if those subpages exist, we turn to the enwiki dumps. Specifically, we want “enwiki-latest-all-titles.gz” – which, as it suggests, is a simple file listing all page titles on the wiki. Extracted, it comes to about 1GB. From this, we can extract all the AFD subpages, as so:

grep "Articles_for_deletion/" enwiki-latest-all-titles | cut -f 2 | sort | uniq | cut -f 2 -d / | sort | uniq > afds

This extracts all the AFD subpages, removes any duplicates (since eg talkpages are listed here as well), and sorts the list alphabetically. There are about 424,000 of them.

Going back to our original list of articles, we want to bin them by age. To a first approximation, pageid is sequential with age – it’s assigned when the page is first created. There are some big caveats here; for example, a page being created as a redirect and later expanded will have the ID of its initial creation. Pages being deleted and recreated may get a new ID, pages which are merged may end up with either of the original IDs, and some complicated page moves may end up with the original IDs being lost. But, for the majority of pages, it’ll work out okay.

To correlate pageID to age, I did a bit of speculative guessing to find an item created on 1 January and 1 July every year back to 2009 (eg pageid 43190000 was created at 11am on 1 July 2014). I could then use these to extract the articles corresponding to each period as so:

...
awk -F '\t' '$2 >= 41516000 && $2 < 43190000' < men-noheader.tsv > bins/2014-1-M
awk -F '\t' '$2 >= 43190000 && $2 < 44909000' < men-noheader.tsv > bins/2014-2-M
...

This finds all items with a pageid (in column #2 of the file) between the specified values, and copies them into the relevant bin. Run once for men and once for women.

Then we can run a short report, along these lines (the original had loops in it):

  cut -f 1 bins/2014-1-M | sort > temp-M
  echo -e 2014-1-M"\tM\t"`cat bins/2014-1-M | wc -l`"\t"`awk '{ total += $3; count++ } END { print total/count }' bins/2014-1-M`"\t"`comm -1 -2 temp-M afds | wc -l` >> report.tsv

This adds a line to the file report.tsv with (in order) the name of the bin, the number of entries in it, the mean value of the length column, and a count of the number which also match names in the afds file. (The use of the temp-M file is to deal with the fact that the comm tool needs properly sorted input).

After that, generating the data is lovely and straightforward – drop the report into a spreadsheet and play around with it.

On Article 50

Since early last year, I live in Islington North (pretty much bang in the middle of it, in fact). This means that my MP is one Jeremy Corbyn.

This is a bit odd. I’ve never been represented by someone who wasn’t a backbencher before (at least, not at Westminster; for a brief while years ago my MSP was the subduedly-titled Deputy Minister for Justice). It also means that there is very little reason for me to ever write to my MP – his positions on something are usually front-page news and for any given topic I can figure out pretty quickly that either he’s already made a statement supporting it or disagrees with me entirely.

But, the Article 50 vote looms, and I felt I ought to do it for once. I know he disagrees with me; I know he’s whipped his party that way. The letter is a cry in the dark. But, well, you do what you must do.

Dear Mr. Corbyn,

I am writing in regard to the Article 50 second reading vote scheduled for Wednesday February 1st. As a constituent, I urge you to reconsider your position on this bill, and to vote against it at the second reading.

Firstly, I wish to remind you that around 75% of your constituents voted Remain, on a turnout of 70%. Not only was Islington one of the most strongly pro-EU areas of the country, this was a larger share of the electorate than you yourself have ever received from the constituency – and it has always been a solidly Labour seat. This is a remarkable result, and I feel it is only proper that you acknowledge your constituents’ clearly expressed position here.

Secondly, on pragmatic grounds, this bill is likely to pass without significant amendments, and thus without any controls on Brexit barring those imposed by a weak Prime Minister. As such, it is essentially handing a blank cheque to the hard right of the Conservative Party, giving them the carte blanche to engineer a Brexit most suited to their desired outcomes – xenophobic, intolerant, and devastating to the British people. This is a horrendous prospect.

Rejecting this bill at second reading will not stop Brexit and will not invalidate the referendum. However, rejecting the bill will have a decent chance of forcing these discussions to be open, to take place on a cross-party basis, and ensure that what emerges has a chance of being positive for us all.

Thirdly, the wider context. Internationally, the world has changed dramatically since last summer. Europe, with all its flaws, is a beacon of light and sanity compared to the United States, our closest non-EU ally. As you yourself noted yesterday, the Prime Minister’s support for Donald Trump places her firmly on the wrong side of history.

And in this light, the referendum result has some resonance. You were one of 84 Labour members to defy a whip and vote against the invasion of Iraq. A poll conducted the same day found about 50-55% of the country in favour of the war – the same number that voted to leave the EU.

A slim majority of the country – and the government – got it wrong a decade ago. We are not infallible. Sometimes, we all take the wrong steps and put ourselves on the wrong side of history. Now is a chance to put the brakes on and decide quite what we are doing, to move slowly and contemplatively, before continuing further.

I urge you to vote against this bill.

Taking pictures with flying government lasers

Well, sort of.

A few weeks ago, the Environment Agency released the first tranche of their LIDAR survey data. This covers (most of) England, at varying resolution from 2m to 25cm, made via LIDAR airborne survey.

It’s great fun. After a bit of back-and-forth (and hastily figuring out how to use QGIS), here’s two rendered images I made of Durham, one with buildings and one without, now on Commons:

The first is shown with buildings, the second without. Both are at 1m resolution, the best currently available for the area. Note in particular the very striking embankment and cutting for the railway viaduct (top left). These look like they could be very useful things to produce for Commons, especially since it’s – effectively – very recent, openly licensed, aerial imagery…

1. Selecting a suitable area

Generating these was, on the whole, fairly easy. First, install QGIS (simplicity itself on a linux machine, probably not too much hassle elsewhere). Then, go to the main data page and find the area you’re interested in. It’s arranged on an Ordnance Survey grid – click anywhere on the map to select a grid square. Major grid squares (Durham is NZ24) are 10km by 10km, and all data will be downloaded in a zip file containing tiles for that particular region.

Let’s say we want to try Cambridge. The TL45 square neatly cuts off North Cambridge but most of the city is there. If we look at the bottom part of the screen, it offers “Digital Terrain Model” at 2m and 1m resolution, and “Digital Surface Model” likewise. The DTM is the version just showing the terrain (no buildings, trees, etc) while the DSM has all the surface features included. Let’s try the DSM, as Cambridge is not exactly mountainous. The “on/off” slider will show exactly what the DSM covers in this area, though in Cambridge it’s more or less “everything”.

While this is downloading, let’s pick our target area. Zooming in a little further will show thinner blue lines and occasional superimposed blue digits; these define the smaller squares, 1 km by 1 km. For those who don’t remember learning to read OS maps, the number on the left and the number on the bottom, taken together, define the square. So the sector containing all the colleges along the river (a dense clump of black-outlined buildings) is TL4458.

2. Rendering a single tile

Now your zip file has downloaded, drop all the files into a directory somewhere. Note that they’re all named something like tl4356_DSM_1m.asc. Unsurprisingly, this means the 1m DSM data for square TL4356.

Fire up QGIS, go to Layer > Add raster layer, and select your tile – in this case, TL4458. You’ll get a crude-looking monochrome image, immediately recognisable by a broken white line running down the middle. This is the Cam. If you’re seeing this, great, everything’s working so far. (This step is very helpful to check you are looking at the right area)

Now, let’s make the image. Project > New to blank everything (no need to save). Then Raster > Analysis > DEM (terrain models). In the first box, select your chosen input file. In the next box, the output filename – with a .tif suffix. (Caution, linux users: make sure to enter or select a path here, otherwise it seems to default to home). Leave everything else as default – all unticked and mode: hillshade. Click OK, and a few seconds later it’ll give a completed message; cancel out of the dialogue box at this point. It’ll be displaying something like this:

Congratulations! Your first LIDAR rendering. You can quit out of QGIS (you can close without saving, your converted file is saved already) and open this up as a normal TIFF file now; it’ll be about 1MB and cover an area 1km by 1km. If you look closely, you can see some surprisingly subtle details despite the low resolution – the low walls outside Kings College, for example, or cars on the Queen’s Road – Madingley Road roundabout by the top left.

3. Rendering several tiles

Rendering multiple squares is a little trickier. Let’s try doing Barton, which conveniently fits into two squares – TL4055 and TL4155. Open QGIS up, and render TL4055 as above, through Raster > Analysis > DEM (terrain models). Then, with the dialogue window still open, select TL4155 (and a new output filename) and run it again. Do this for as many files as you need.

After all the tiles are prepared, clear the screen by starting a new project (again, no need to save) and go to Raster > Miscellaneous > Merge. In “Input files”, select the two exports you’ve just done. In “Output file”, pick a suitable filename (again ending in .tif). Hit OK, let it process, then close the dialog. You can again close QGIS without saving, as the export’s complete.

The rendering system embeds coordinates in the files, which means that when they’re assembled and merged they’ll automatically slot together in the correct position and orientation – no need to manually tile them. The result should look like this:

The odd black bit in the top right is the edge of the flight track – there’s not quite comprehensive coverage. This is a mainly agricultural area, and you can see field markings – some quite detailed, and a few bits on the bottom of the right-hand tile that might be traces of old buildings.

So… go forth! Make LIDAR images! See what you can spot…

4. Command-line rendering in bulk

Richard Symonds (who started me down this rabbit-hole) points out this very useful post, which explains how to do the rendering and merging via the command line. Let’s try the entire Durham area; 88 files in NZ24, all dumped into a single directory –

for i in `ls *.asc` ; do gdaldem hillshade -compute_edges $i $i.tif ; done

gdal_merge.py -o NZ24-area.tif *.tif


rm *.asc.tif

In order, that a) runs the hillshade program on each individual source file ; b) assembles them into a single giant image file; c) removes the intermediate images (optional, but may as well tidy up). The -compute_edges flag helpfully removes the thin black lines between sectors – I should have turned it on in the earlier sections!

Canadian self-reported birthday data

In the last post, we saw strong evidence for a “memorable date” bias in self-reported birthday information among British men born in the late 19th century. In short, they were disproportionately likely to think they were born on an “important day” such as Christmas.

It would be great to compare it to other sources. However, finding a suitable dataset is challenging. We need a sample covering a large number of men, over several years, and which is unlikely to be cross-checked or drawn from official documentation such as birth certificates or parish registers. It has to explicitly list full birthdates (not just month or year)

WWI enlistment datasets are quite promising in this regard – lots of men, born about the same time, turning up and stating their details without particularly much of a reason to bias individual dates. The main British records have (famously) long since burned, but the Australian and Canadian records survive. Unfortunately, the Australian index does not include dates of birth, but the Canadian index does (at least, when known). So, does it tell us anything?

The index is available as a 770mb+ XML blob (oh, dear). Running this through xmllint produces a nicely formatted file with approximately 575,000 birthdays for 622,000 entries. It’s formatted in such a way as to imply there may be multiple birthdates listed for a single individual (presumably if there’s contradictory data?), but I couldn’t spot any cases. There’s also about ten thousand who don’t have nicely formatted dd/mm/yyyy entries; let’s omit those for now. Quick and dirt but probably representative.

And so…

There’s clearly a bit more seasonality here than in the British data (up in spring, down in winter), but also the same sort of unexpected one-day spikes and troughs. As this is quite rough, I haven’t corrected for seasonality, but we still see something interesting.

The highest ten days are: 25 December (1.96), 1 January (1.77), 17 March (1.56), 24 May (1.52), 1 May (1.38), 15 August (1.38), 12 July (1.36), 15 September (1.34), 15 March (1.3).

The lowest ten days are: 30 December (0.64), 30 January (0.74), 30 October (0.74), 30 July (0.75), 30 May (0.78), 13 November (0.78), 30 August (0.79), 26 November (0.80), 30 March (0.81), 12 December (0.81).

The same strong pattern for “memorable days” that we saw with the UK is visible in the top ten – Christmas, New Year, St. Patrick’s, Victoria Day, May Day, [nothing], 12 July, [nothing], [nothing].

Two of these are distinctively “Canadian” – both 24 May (the Queen’s birthday/Victoria Day) and 12 July (the Orange Order marches) are above average in the British data, but not as dramatically as they are here. Both appear to have been relatively more prominent in late-19th/early-20th century Canada than in the UK. Canada Day/Dominion Day (1 July) is above average but does not show up as sharply, possibly because it does not appear to have been widely celebrated until after WWI.

One new pattern is the appearance of the 15th of the month in the top 10. This was suggested as likely in the US life insurance analysis and I’m interested to see it showing up here. Another oddity is leap years – in the British data, 29 February was dramatically undercounted. In the Canadian data, it’s strongly overcounted – just not quite enough to get into the top ten. 28 February (1.28), 29 February (1.27) and 1 March (1.29) are all “memorable”. I don’t have an explanation for this but it does suggest an interesting story.

Looking at the lowest days, we see the same pattern of 30/xx dates being very badly represented – seven of the ten lowest dates are 30th of the month…. and all from days where there were 31 days in the month. This is exactly the same pattern we observed in UK data, and I just don’t have any convincing reason to guess why. The other three dates all fall in low-birthrate months,

So, in conclusion:

  • Both UK and Canadian data from WWI show a strong bias for people to self-report their birthday as a “memorable day”;
  • “Memorable” days are commonly a known and fixed festival, such as Christmas;
  • Overreporting of arbitrary numbers like the 15th of the month are more common in Canada (& possibly the US?) than the UK;
  • The UK and Canadian samples seem to treat 29 February very differently – Canadians overreport, British people underreport;
  • There is a strong bias against reporting the 30th of the month particularly in months with 31 days

Thoughts (or additional data sources) welcome.

When do you think you were born?

Back in the last post, we were looking at a sample of dates-of-birth in post-WWI Army records.

(To recap – this is a dataset covering every man who served in the British Army after 1921 and who had a date of birth in or before 1900. 371,716 records in total, from 1864 to 1900, strongly skewed towards the recent end.)

I’d suggested that there was an “echo” of 1914/15 false enlistment in there, but after a bit of work I’ve not been able to see it. However, it did throw up some other very interesting things. Here’s the graph of birthdays.

Two things immediately jump out. The first is that the graph, very gently, slopes upwards. The second is that there are some wild outliers.

The first one is quite simple to explain; this data is not a sample of men born in a given year, but rather those in the army a few decades later. The graph in the previous post shows a very strong skew towards younger ages, so for any given year we’d expect to find marginally more December births than January ones. I’ve normalised the data to reflect this – calculated what the expected value for any given day would be assuming a linear increase, then calculated the ratio of reported to expected births. [For 29 February, I quartered its expected value]

There are hints at a seasonal pattern here, but not a very obvious one. January, February, October and November are below average, March and September above average, and the rest of the spring-summer is hard to pin down. (For quite an interesting discussion on “European” and “American” birth seasonality, see this Canadian paper)

The interesting bit is the outliers, which are apparent in both graphs.

The most overrepresented days are, in order of frequency, 1 January (1.8), 25 December (1.43), 17 March (1.33), 28 February (1.27), 14 February (1.22), 1 May (1.22), 11 November (1.19), 12 August (1.17), 2 February (1.15), and 10 October (1.15). Conversely, the most underrepresented days are 29 February (0.67 after adjustment), 30 July (0.75), 30 August (0.78), 30 January (0.81), 30 March (0.82), and 30 May (0.84).

Of the ten most common days, seven are significant festivals. In order: New Year’s Day, Christmas Day, St. Patrick’s Day, [nothing], Valentine’s Day, May Day, Martinmas, [nothing], Candlemas, [nothing].

Remember, the underlying bias of most data is that it tells you what people put into the system, not what really happened. So, what we have is a dataset of what a large sample of men born in late nineteenth century Britain thought their birthdays were, or of the way they pinned them down when asked by an official. “Born about Christmastime” easily becomes “born 25 December” when it has to go down on a form. (Another frequent artefact is overrepresentation of 1-xx or 15-xx dates, but I haven’t yet looked for this.) People were substantially more likely to remember a birthday as associated with a particular festival or event than they were to remember a random date.

It’s not all down to being memorable, of course; 1 January is probably in part a data recording artefact. I strongly suspect that at some point in the life of these records, someone’s said “record an unknown date as 1/1/xx”.

The lowest days are strange, though. 29 February is easily explained – even correcting for it being one quarter as common as other days, many people would probably put 28 February or 1 March on forms for simplicity. (This also explains some of the 28 February popularity above). But all of the other five are 30th of the month – and all are 30th of a 31-day month. I have no idea what might explain this. I would really, really love to hear suggestions.

One last, and possibly related, point – each month appears to have its own pattern. The first days of the month are overrepresented; the last days underrepresented. (The exception is December and possibly September). This is visible in both normalised and raw data, and I’m completely lost as to what might cause it…

Back to the Army again

In the winter of 1918-19, the British government found itself in something of a quandary. On the one hand, hurrah, the war was over! Everyone who had signed up to serve for “three years or the duration” could go home. And, goodness, did they want to go home.

On the other hand, the war… well it wasn’t really over. There were British troops fighting deep inside Russia; there were large garrisons sitting in western Germany (and other, less probable, places) in case the peace talks collapsed; there was unrest around the Empire and fears about Bolsheviks at home.

So they raised another army. Anyone in the army who volunteered to re-enlist got a cash payment of £20 to £50 (no small sum in 1919); two month’s leave with full pay; plus comparable pay to that in wartime and a separation allowance if he was married. Demobilisation continued for everyone else (albeit slowly), and by 1921, this meant that everyone in the Army was either a very long-serving veteran, a new volunteer who’d not been conscripted during wartime (so born 1901 onwards) or – I suspect the majority – re-enlisted men working through their few years service.

For administrative convenience, all records of men who left up to 1921 were set aside and stored by a specific department; the “live” records, including more or less everyone who reenlisted, continued with the War Office. They were never transferred – and, unlike the pre-1921 records, they were not lost in a bombing raid in 1940.

The MoD has just released an interesting dataset following an FOI request – it’s an index of these “live” service records. The records cover all men in the post-1921 records with a DoB prior to 1901, and thus almost everyone in it would have either remained in service or re-enlisted – there would be a small proportion of men born in 1900 who escaped conscription (roughly 13% of them would have turned 18 just after 11/11/18), and a small handful of men will have re-enlisted or transferred in much later, but otherwise – they all would have served in WWI and chosen to remain or to return very soon after being released.

So, what does this tell us? Well, for one thing, there’s almost 317,000 of them. 4,864 were called Smith, 3,328 Jones, 2,104 Brown, 1,172 Black, etc. 12,085 were some form of Mac or Mc. And there are eight Singhs, which looks like an interesting story to trace about early immigrants.

But, you know, data cries out to be graphed. So here’s the dates of birth.

Since the 1900 births are probably an overcount for reenlistments, I’ve left these off.

It’s more or less what you’d expect, but on close examination a little story emerges. Look at 1889/90; there’s a real discontinuity here. Why would this be?

Pre-war army enlistments were not for ‘the duration’ (there was nothing to have a duration of!) but for seven years service and five in the reserves. There was a rider on this – if war broke out, you wouldn’t be discharged until the crisis was over. The men born 1900 would have enlisted in 1908 and been due for release to the reserves in 1915. Of course, that never happened… and so, in 1919, many of these men would have been 29, knowing no other career than soldiering. Many would have been thrilled to get out – and quite a few more would have considered it, and realised they had no trade, and no great chance of good employment. As Kipling had it in 1894:

A man o’ four-an’-twenty what ‘asn’t learned of a trade—
Except “Reserve” agin’ him—’e’d better be never made.

It probably wasn’t much better for him in 1919.

Moving right a bit, 1896-97 also looks odd – this is the only point in the data where it goes backwards, with marginally more men born in 1896 than 1897. What happened here?

Anyone born before August 1896 was able to rush off and enlist at the start of the war; anyone born after that date would either have to wait, or lie. Does this reflect a distant echo of people giving false ages in 1914/15 and still having them on the paperwork at reenlistment? More research no doubt needed, but it’s an interesting thought.