Gender and BLPs on Wikipedia, redux

Back in 2019 I generated some data on how the English Wikipedia’s biographies of living people (BLPs) broke down by gender, and how that intersected with creation and deletion rates. The headline figures were that:

  • There was a significant difference between the gender split for all biographies, at 17.8% women – and for biographies of living people (BLPs), 22.7%.
  • In 2009, around 20% of existing BLPs were on women. As time went on, the average share of BLPs increased slowly, by perhaps a quarter of a percentage point per year.
  • In 2009, around 20% of newly created BLPs were on women. In about 2012, this kicked up a gear, rising above the long term average – first to about 25%, peaking around 33% before falling back a little.
  • BLP articles on women were more likely to be nominated for deletion until about 2017, when the effect disappeared.

One thing that was raised during the subsequent discussion was that a lot of the skew by gender was potentially linked to subject area – I was able to identify that for athletes (defined broadly, all sports players) the articles were much more likely to be men. I didn’t investigate this too much, though. Last week, I was reminded about this, and I’ve been looking at the numbers again. It brought up two interesting divergences.

Please don’t make me read all this

Okay – but you’ll miss the graphs. In summary:

English Wikipedia has more women in recent cohorts (about ~25% of living people born since the seventies) and there are far more men among athletes. Since the athletes make up a staggeringly high amount of articles among younger subjects, the gender split among non-athletes is much more balanced – a little under a third overall, but breaking 50% female among the younger cohorts.

Still with me? Let’s start. Sorry about the spoilers.

Time and tide

The first phenomenon is very straightforward: while the overall percentage across all people is around 25% women, how that is distributed over time varies. In general, there is a steady rise until about the 1970s; for those born from the 1970s onwards, the generation who are currently in their active working lives, the level is relatively stable at around 25% women.

The exception is those born in the 1920s (where it sits at 26%) – this is presumably affected by the fact that at this point, female life expectancy is significantly higher than male, and so the proportion of women begins to rise as a result.

One surprising outcome, however, is that the share of living people with no recorded age (green) is much more female than the average. This is a large cohort – there are in fact slightly more women in it than in any individual decade. I believe that it skews young – in other words, were this information known, it would increase the share of women in recent decades – but it is hard to find a way to confirm this. This issue is discussed in more detail below.

(Those born in the 2010s/20s and in the 1900s/10s are omitted – the four groups have a total of 175 articles, while the cohorts shown range from 5,000 to 170,000 – but the levels are around 50%. This is likely due to life expectancy in the oldest cohorts, and the fact that the people in the youngest cohorts are mostly notable at this point as being “the child of someone famous” – which you would broadly expect to be independent of gender.)

The percentages shown here are of the total male + female articles, but it is also possible to calculate the share of people who have a recorded gender that is not male/female. These show a very striking rise over time, though it should be cautioned that the absolute numbers are small – the largest single cohort is the 1980s with 345 people out of 170,000.

Sports by the numbers

The original question was to look at what the effect of athlete articles is on the overall totals. It turns out… very striking.

A few things are immediately apparent. The first is that the share of athletes is very substantial – it reflects only around a quarter of people born in the 1950s, but 85-90% of people born in the 1990s/2000s.

The second is that those athletes are overwhelmingly men – among the 1950s cohort, only about 10% of those athletes are female, and even by recent years it is only around 20%. This means that if we look purely at the non-athlete articles, the gender split becomes a lot more balanced.

Across all articles, it is around 32% female. But among living non-athletes, born since 1990, the gender balance is over 50% female.

This is a really amazing figure. I don’t think I ever particularly expected to see a gender analysis on Wikipedia that would break 50%. Granted, the absolute numbers involved are low – as is apparent from the previous graph, “non-athletes born in the 1990s” is around 22,000 people, and “born in the 2000s” is as low as 2,500 – but it’s a pretty solid trend and the total numbers for the earlier decades are definitely large enough for it to be no anomaly.

(Eagle-eyed readers will note that these do not quite align with the numbers in the original linked discussion – those were a couple of points lower in recent decades. I have not quite worked out why, but I think this was an error in the earlier queries; possibly it was counting redirects?)

One last detail to note: the “date missing” cohort comes out over 90% non-athletes. Presumably this is because their exact age is often significant and linked in to eg when they start professional sports, so it’s easily publicly available.

Methodology: the thousand word footnote

Feel free to let your eyes glaze over now.

These numbers were constructed mostly using the petscan tool, and leveraging data from both English Wikipedia and Wikidata. From Wikipedia, we have a robust categorisation system for year/decade of birth, and for whether someone is a living person. From Wikidata, we have fairly comprehensive gender data, which Wikipedia doesn’t know about. (It also has dates of birth, but it is more efficient to use WP categories here). So it is straightforward to produce intersection queries like “all living people marked as 1920s births and marked as female” (report). Note that this is crunching a lot of data – don’t be surprised if queries take a minute or two to run or occasionally time out.

To my surprise, the report for “living people known to be female” initially produced a reliable figure, but one for “living people known to be male” produced a figure that was an undercount. (I could validate this by checking against some small categories where I could run a report listing the gender of every item). The root cause seemed to be a timeout in the Wikidata query – I was originally looking for { ?item wdt:P31 wd:Q5 . wdt:P21 wd:Q6581097 } – items known to be human with gender male. Tweaking this to be simply { ?item wdt:P21 wd:Q6581097 } – items with gender male – produced a reliable figure. Similarly, we had the same issue when trying to get a total for all items with reported gender – simply { ?item wdt:P21 ?val } works.

Percentages are calculated as percentage of the number of articles identified as (male + female), rather than of all BLPs with a recorded gender value or simply of all BLPs. There are good arguments for either of the first two, but the former is simpler (some of my “any recorded gender value” queries timed out) and also consistent with the 2019 analysis.

A thornier problem comes from the sports element. There are a number of potential ways we could determine “sportiness”. The easiest option would be to use Wikidata occupation and look for something that indicates their occupation is some form of athlete, or that indicates a sport being played. The problem is that this is too all-encompassing, and would give us people who played sports but for whom it is not their main claim to fame. An alternative is to use the Wikipedia article categorisation hierarchy, but this is very complex and deep, making the queries very difficult to work with. The category hierarchy includes a number of surprise crosslinks and loops, meaning that deep queries tend to get very confusing results, or just time out.

The approach I eventually went with was to use Wikipedia’s infoboxes – the little standardised box on the top right of a page. There are a wide range of distinct infobox templates tailored to specific fields; each article usually only displays one, but can embed elements of others to bring in secondary data. If we look for articles using one of the 77(!) distinct sports infoboxes (report), we can conclude they probably had a significant sporting career. An article that does not contain one can be inferred to not have a sporting background.

But then we need to consider people with significant sports and non-sports careers. For example, the biographies of both Seb Coe and Tanni Grey-Thompson use the “infobox officeholder” to reflect their careers in Parliament being more recent, but it is set up to embed a sports infobox towards the end. This would entail them being counted as athletes by our infobox method. This is probably correct for those two, but there are no doubt people out there where we would draw the line differently. (To stay in the UK political theme: how about Henry McLeish? His athletic career on its own would probably just qualify for a Wikipedia biography, but it is, perhaps, a bit of a footnote compared to being First Minister…)

So, here is another complication. How reliable is our assumption that an athlete has a sports infobox, and that non-athletes don’t? If it’s broadly true, great, our numbers hold up. If it’s not, and if it’s not in some kind of systematic way, there might be a more complex skew. I believe that for modern athletes, it’s reasonably safe to assume that infoboxes are nearly ubiquitous; there are groups of articles where they’re less common, but this isn’t one of them. However, I can’t say for sure; it’s not an area I’ve worked intensively in.

Finally, we have the issue of dates. We’ve based the calculation on Wikipedia categories. Wikipedia birth/death categories are pretty reliably used where that data is known. However, about 150k (14%) of our BLP articles are marked “year of birth unknown”, and these are disproportionately female (35.4%).

What effect do these factors have?

Counting the stats as percentage of M+F rather than percentage of all people with recorded gender could be argued either way, but the numbers involved are quite low and do not change the overall pattern of the results.

The infobox question is more complicated. It is possible that it is meaning we are not picking up all athletes because they do not have infoboxes. On the other hand, it is possible that it is meaning we are being more expansive in counting people as athletes because they have a “secondary” infobox along the line of Coe & Grey-Thompson above. The problem there is defining where we draw the line, and what level of “other significance” stops someone being counted. That feels like a very subjective threshold and hard to test for automatically. It is certainly a more conservative test than a Wikidata-based one, at least.

And for dates, hmm. We know that the articles that do not report an age are disproportionately female (35% vs the BLP average of 25%), but also that they are even more disproportionately “not athletes” (7% athletes vs the BLP average of 43%). There are also a lot of articles that don’t report an age; around 14% of all BLPs.

This one probably introduces the biggest question mark here. Depending on how that 14% break down, it could change the totals for the year-by-year cohorts; but there’s not really much we can do at the moment to work that out.

Anecdotally, I suspect that they are more likely to skew younger rather than being evenly distributed over time, but there is very little to go on here. However, I feel it is unlikely they would be distributed in such a way as to counteract the overall conclusions – this would require, for example, the female ones being predominantly shifted into older groups and the male ones into younger groups. It’s possible, but I don’t see an obvious mechanism to cause that.

[Edit 4/8/23 – tweaked to confirm that these are English Wikipedia only figures, after a reminder from Hilda. I would be very interested in seeing similar data for other projects, but the methodology might be tricky to translate – eg French and German do not have an equivalent category for indexing living people, and different projects may have quite different approaches for applying infoboxes.]

One thought on “Gender and BLPs on Wikipedia, redux”

Leave a Reply

Your email address will not be published. Required fields are marked *