Demographics in Wikipedia

There’s a lengthy internal debate going on in Wikipedia at the moment (see here, if you really want to look inside the sausage factory) about how best to deal with the perennial article of biographies of living people, of which there are about 400,000.

As an incidental detail to this, people have been examining the issue from all sorts of angles. One particularly striking graph that’s been floating around shows the number of articles marked as being born or died in any given year from the past century:


As the notes point out, we can see some interesting effects here. Firstly – and most obviously – is the “recentism”; people who are alive and active in the present era tend to be more likely to have articles written about them, so you get more very recent deaths than (say) people who died forty years ago. Likewise, you have a spike around the late 1970s / early 1980s of births of people who’re just coming to public attention – in other words, people in their early thirties or late twenties are more likely to have articles written about them.

If we look back with a longer-term perspective, we can see that the effects of what Wikipedia editors have chosen to write about diminish, and the effects of demographics become more obvious. There are, for example, suggestions of prominent blips in the deathrate during the First and Second World Wars, and what may be the post-war baby boom showing up in the late 1940s.

So, we can distinguish two effects; underlying demographics, and what people choose to write about.

(In case anyone is wondering: people younger than 25 drop off dramatically. The very youngest are less than a year old, and are invariably articles about a) heirs to a throne; b) notorious child-murder cases; c) particularly well-reported conjoined twins or other multiple births. By about the age of five you start getting a fair leavening of child actors and the odd prodigy.)

Someone then came up with this graph, which is the same dataset drawn from the French Wikipedia:


At a glance, they look quite similar, which tells us that the overall dynamic guiding article-writing is broadly the same in both cases. This doesn’t sound that drastic a change, but different language editions can vary quite dramatically in things like standards for what constitutes a reasonable topic, so it is useful to note. French has a more pronounced set of spikes in WWI, WWII, and the post-war baby boom, though, as well as a very distinctive lowering of the birthrate during WWI. These are really quite interesting, especially the latter one, because it suggests we’re seeing a different underlying dynamic. And the most likely underlying dynamic is, of course, that Francophones tend to prefer writing about Francophones, and Anglophones tend to prefer writing about Anglophones…

So, how does this compare in other languages? I took these two datasets, and then added Czech (which someone helpfully collected), German and Spanish. (The latter two mean we have four of the five biggest languages represented. I’d have liked to include Polish, but the data was not so easily accessible.) I then normalised it, so each year was a percentage of the average for that language for that century, and graphed them against each other:

What can we see from these? Overall, every project has basically the same approach to inclusion; ramping up steadily over time, a noticeable spike in people who died during WWII or in the past two decades, and a particular interest in people who are about thirty and in the public eye. There is one important exception to this last case – German, which has a flat birthrate from about 1940 onwards, and apparently no significant recentism in this regard. The same is true of Czech to a limited degree. (Anecdotally I believe the same may be true of Japanese, but I haven’t managed to gather the data yet)

The WWII death spike is remarkably prominent in German and Czech, moderately prominent in French, and apparent but less obvious in English and Spanish. This could be differential interest in military history, where biographies tend to have deaths clustered in wartime, but it also seems rational to assume this reflects something of the underlying language-biased data. More Central Europeans died in WWII than Western Europeans; proportionally fewer died in the Anglosphere because English-speaking civilian populations escaped the worst of it, and the Spanish-speaking world was mostly uninvolved. The deaths in WWI are a lot more tightly clustered, and it’s hard to determine anything for sure here.

The other obvious spike in deaths is very easy to understand from either interpretation of the reason; it’s in 1936, in Spanish, which coincides with the outbreak of the Civil War. Lots of people to write articles about, there, and people less likely to be noted outside of Spain itself.

I mentioned above that (older) birthrates are more likely to represent an underlying demographic reality than deathrates are; localised death rates could be altered by a set of editors who choose to write on specific themes. You’d only get a birthdate spike, it seems, if someone was explicitly choosing to write about people born in a specific period; it’s hard to imagine it from a historical perspective. Historically linked people are grouped by when they’re prominent and active, and that happens at a variable time in their lives, so someone specifically writing about a group of people is likely to “smear” out their birthdates in a wide distribution.

So, let’s look at the historic births graph and see if anything shows up there. German and French show very clear drops in the birth rate between 1914 and about 1920, round U-shaped falls. German appears to have a systemic advantage over the other projects in birthrate through the 1930s and 1940s, though as the data is normalised against an average this may be misleadingly inflated – it doesn’t have the post-1970 bulge most languages do. The very sharp drop in births in 1945 is definitely not an artefact, though; you can see it to a lesser degree in the other languages, except English, where it’s hardly outside normal variance.

So, there does seem to be a real effect here; both these phenomena seem predictable as real demographic events, and the difference between the languages is interpretable as different populations suffering different effects in these periods and being represented to different degrees in the selection of people by various projects.

The next step would be, I suppose, to compare those figures to known birth and death rates both globally and regionally over the period; this would let us estimate of the various degrees of “parochialism” involved in the various projects’ coverage of people, as well as the varying degrees of “recentness” which we’ve seen already. Any predictions?

6 thoughts on “Demographics in Wikipedia”

  1. Marvellous, thankyou!

    I was actually plotting out how to write a script to do this for me – doing it by hand was pretty tedious, as you can imagine – but discovering someone’s done it for me is even better :-)

    I’ll rerun the data this weekend, hopefully, and see if any new patterns emerge. Thanks again…

Leave a Reply

Your email address will not be published. Required fields are marked *