Gender and deletion on Wikipedia

So, a really interesting question cropped up this weekend:

I’m trying to find out how many biographies of living persons exist on the English Wikipedia, and what kind of data we have on them. In particular, I’m looking for the gender breakdown. I’d also like to know when they were created; average length; and whether they’ve been nominated for deletion.

This is, of course, something that’s being discussed a lot right now; there is a lot of emerging push-back against the excellent work being done to try and add more notable women to Wikipedia, and one particular deletion debate got a lot of attention in the past few weeks, so it’s on everyone’s mind. And, instinctively, it seems plausible that there is a bias in the relative frequency of nomination for deletion – can we find if it’s there?

My initial assumption was, huh, I don’t think we can do that with Wikidata. Then I went off and thought about it for a bit more, and realised we could get most of the way there of it with some inferences. Here’s the results, and how I got there. Thanks to Sarah for prompting the research!

(If you want to get the tl;dr summary – yes, there is some kind of difference in the way older male vs female articles have been involved with the deletion process, but exactly what that indicates is not obvious without data I can’t get at. The difference seems to have mostly disappeared for articles created in the last couple of years.)

Statistics on the gender breakdown of BLPs

As of a snapshot of yesterday morning, 5 May 2019, the English Wikipedia had 906,720 articles identified as biographies of living people (BLPs for short). Of those, 697,402 were identified as male by Wikidata, 205,117 as female, 2464 had some other value for gender, 1220 didn’t have any value for gender (usually articles on groups of people, plus some not yet updated), and 517 simply didn’t have a connected Wikidata item (yet). Of those with known gender, it breaks down as 77.06% male, 22.67% female, and 0.27% some other value. (Because of the limits of the query, I didn’t try and break down those in any more detail.)

This is, as noted, only articles about living people; across all 1,626,232 biographies in the English Wikipedia with a gender known to Wikidata, it’s about 17.83% female, 82.13% male, and 0.05% some other value. I’ll be sticking to data on living people throughout this post, but it’s interesting to compare the historic information.

So, how has that changed over time?

BLPs by gender and date of creation

This graph shows all existing BLPs, broken down by gender and (approximately) when they were created. As can be seen, and as might be expected, the gap has closed a bit over time.

Percentage of BLPs which are female over time

Looking at the ratio over time (expressed here as %age of total male+female), the relative share of female BLPs was ~20% in 2009. In late 2012, the rate of creation of female BLPs kicked up a gear, and from then on it’s been noticeably above the long-term average (almost hitting 33% in late 2017, but dropping back since then). This has driven the overall share steadily and continually upwards, now at 22.7% (as noted above).

Now the second question, do the article lengths differ by gender? Indeed they do, by a small amount.

BLPs by current article size and date of creation

Female BLPs created at any time since 2009 are slightly longer on average than male ones of similar age, with only a couple of brief exceptions; the gap may be widening over the past year but it’s maybe too soon to say for sure. Average difference is about 500 bytes or a little under 10% of mean article size – not dramatic but probably not trivial either. (Pre-2009 articles, not shown here, are about even on average)

Note that this is raw bytesize – actual prose size will be smaller, particularly if an article is well-referenced; a single well-structured reference can be a few hundred characters. It’s also the current article size, not size at creation, hence why older articles tend to be longer – they’ve had more time to grow. It’s interesting to note that once they’re more than about five years old they seem to plateau in average length.

Finally, the third question – have they been nominated for deletion? This was really interesting.

Percentage of BLPs which have previously been to AFD, by date of creation and gender

So, first of all, some caveats. This only identifies articles which go through the structured “articles for deletion” (AFD) process – nomination, discussion, decision to keep or delete. (There are three deletion processes on Wikipedia; the other two are more lightweight and do not show up in an easily traceable form). It also cannot specifically identify if that exact page was nominated for deletion, only that “an article with exactly the same page name has been nominated in the past” – but the odds are good they’re the same if there’s a match. It will miss out any where the article was renamed after the deletion discussion, and, most critically, it will only see articles that survived deletion. If they were deleted, I won’t be able to see them in this analysis, so there’s an obvious survivorship bias limiting what conclusions we can draw.

Having said all that…

Female BLPs created 2009-16 appear noticeably more likely than male BLPs of equivalent age to have been through a deletion discussion at some point in their lives (and, presumably, all have been kept). Since 2016, this has changed and the two groups are about even.

Alongisde this, there is a corresponding drop-off in the number of articles created since 2016 which have associated deletion discussions. My tentative hypothesis is that articles created in the last few years are generally less likely to be nominated for deletion, perhaps because the growing use of things like the draft namespace (and associated reviews) means that articles are more robust when first published. Conversely, though, it’s possible that nominations continue at the same rate, but the deletion process is just more rigorous now and a higher proportion of those which are nominated get deleted (and so disappear from our data). We can’t tell.

(One possible explanation that we can tentatively dismiss is age – an article can be nominated at any point in its lifespan so you would tend to expect a slowly increasing share over time, but I would expect the majority of deletion nominations come in the first weeks and then it’s pretty much evenly distributed after that. As such, the drop-off seems far too rapid to be explained by just article age.)

What we don’t know is what the overall nomination for deletion rate, including deleted articles, looks like. From our data, it could be that pre-2016 male and female articles are nominated at equal rates but more male articles are deleted; or it could be that pre-2016 male and female articles are equally likely to get deleted, but the female articles are nominated more frequently than they should be. Either of these would cause the imbalance. I think this is very much the missing piece of data and I’d love to see any suggestions for how we can work it out – perhaps something like trying to estimate gender from the names of deleted articles?

Update: Magnus has run some numbers on deleted pages, doing exactly this – inferring gender from pagenames. Of those which were probably a person, ~2/3 had an inferred gender, and 23% of those were female. This is a remarkably similar figure to the analysis here (~23% of current BLPs female; ~26% of all BLPs which have survived a deletion debate female)

So in conclusion

  • We know the gender breakdown: skewed male, but growing slowly more balanced over time, and better for living people than historical ones.
  • We know the article lengths; slightly longer for women than men for recent articles, about equal for those created a long time ago.
  • We know that there is something different about the way male and female biographies created before ~2017 experience the deletion process, but we don’t have clear data to indicate exactly what is going on, and there are multiple potential explanations.
  • We also know that deletion activity seems to be more balanced for articles in both groups created from ~2017 onwards, and that these also have a lower frequency of involvement with the deletion process than might have been expected. It is not clear what the mechanism is here, or if the two factors are directly linked.

How can you extract this data? (Yes, this is very dull)

The first problem was generating the lists of articles and their metadata. The English Wikipedia category system lets us identify “living people”, but not gender; Wikidata lets us identify gender (property P21), but not reliably “living people”. However, we can creatively use the petscan tool to get the intersection of a SPARQL gender query + the category. Instructing it to explicitly use Wikipedia (“enwiki” in other sources > manual list) and give output as a TSV – then waiting for about fifteen minutes – leaves you with a nice clean data dump. Thanks, Magnus!

(It’s worth noting that you can get this data with any characteristic indexed by Wikidata, or any characteristic identifiable through the Wikipedia category schema, but you will need to run a new query for each aspect you want to analyse – the exported data just has article metadata, none of the Wikidata/category information)

The exported files contain three things that are very useful to us: article title, pageid, and length. I normalised the files like so:

grep [0-9] enwiki_blp_women_from_list.tsv | cut -f 2,3,5 > women-noheader.tsv

This drops the header line (it’s the only one with no numeric characters) and extracts only the three values we care about (and conveniently saves about 20MB).

This gives us two of the things we want (age and size) but not deletion data. For that, we fall back on inference. Any article that is put through the AFD process gets a new subpage created at “Wikipedia:Articles for deletion/PAGENAME”. It is reasonable to infer that if an article has a corresponding AFD subpage, it’s probably about that specific article. This is not always true, of course – names get recycled, pages get moved – but it’s a reasonable working hypothesis and hopefully the errors are evenly distributed over time. I’ve racked my brains to see if I could anticipate a noticeable difference here by gender, as this could really complicate the results, but provisionally I think we’re okay to go with it.

To find out if those subpages exist, we turn to the enwiki dumps. Specifically, we want “enwiki-latest-all-titles.gz” – which, as it suggests, is a simple file listing all page titles on the wiki. Extracted, it comes to about 1GB. From this, we can extract all the AFD subpages, as so:

grep "Articles_for_deletion/" enwiki-latest-all-titles | cut -f 2 | sort | uniq | cut -f 2 -d / | sort | uniq > afds

This extracts all the AFD subpages, removes any duplicates (since eg talkpages are listed here as well), and sorts the list alphabetically. There are about 424,000 of them.

Going back to our original list of articles, we want to bin them by age. To a first approximation, pageid is sequential with age – it’s assigned when the page is first created. There are some big caveats here; for example, a page being created as a redirect and later expanded will have the ID of its initial creation. Pages being deleted and recreated may get a new ID, pages which are merged may end up with either of the original IDs, and some complicated page moves may end up with the original IDs being lost. But, for the majority of pages, it’ll work out okay.

To correlate pageID to age, I did a bit of speculative guessing to find an item created on 1 January and 1 July every year back to 2009 (eg pageid 43190000 was created at 11am on 1 July 2014). I could then use these to extract the articles corresponding to each period as so:

...
awk -F '\t' '$2 >= 41516000 && $2 < 43190000' < men-noheader.tsv > bins/2014-1-M
awk -F '\t' '$2 >= 43190000 && $2 < 44909000' < men-noheader.tsv > bins/2014-2-M
...

This finds all items with a pageid (in column #2 of the file) between the specified values, and copies them into the relevant bin. Run once for men and once for women.

Then we can run a short report, along these lines (the original had loops in it):

  cut -f 1 bins/2014-1-M | sort > temp-M
  echo -e 2014-1-M"\tM\t"`cat bins/2014-1-M | wc -l`"\t"`awk '{ total += $3; count++ } END { print total/count }' bins/2014-1-M`"\t"`comm -1 -2 temp-M afds | wc -l` >> report.tsv

This adds a line to the file report.tsv with (in order) the name of the bin, the number of entries in it, the mean value of the length column, and a count of the number which also match names in the afds file. (The use of the temp-M file is to deal with the fact that the comm tool needs properly sorted input).

After that, generating the data is lovely and straightforward – drop the report into a spreadsheet and play around with it.

One thought on “Gender and deletion on Wikipedia”

  1. This is brilliant. It supplies facts in an area where many are speculating conspiracies. This makes me optimistic for the Wikipedia and Women in Red editors.

Leave a Reply

Your email address will not be published. Required fields are marked *