Gender and deletion on Wikipedia

So, a really interesting question cropped up this weekend:

I’m trying to find out how many biographies of living persons exist on the English Wikipedia, and what kind of data we have on them. In particular, I’m looking for the gender breakdown. I’d also like to know when they were created; average length; and whether they’ve been nominated for deletion.

This is, of course, something that’s being discussed a lot right now; there is a lot of emerging push-back against the excellent work being done to try and add more notable women to Wikipedia, and one particular deletion debate got a lot of attention in the past few weeks, so it’s on everyone’s mind. And, instinctively, it seems plausible that there is a bias in the relative frequency of nomination for deletion – can we find if it’s there?

My initial assumption was, huh, I don’t think we can do that with Wikidata. Then I went off and thought about it for a bit more, and realised we could get most of the way there of it with some inferences. Here’s the results, and how I got there. Thanks to Sarah for prompting the research!

(If you want to get the tl;dr summary – yes, there is some kind of difference in the way older male vs female articles have been involved with the deletion process, but exactly what that indicates is not obvious without data I can’t get at. The difference seems to have mostly disappeared for articles created in the last couple of years.)

Statistics on the gender breakdown of BLPs

As of a snapshot of yesterday morning, 5 May 2019, the English Wikipedia had 906,720 articles identified as biographies of living people (BLPs for short). Of those, 697,402 were identified as male by Wikidata, 205,117 as female, 2464 had some other value for gender, 1220 didn’t have any value for gender (usually articles on groups of people, plus some not yet updated), and 517 simply didn’t have a connected Wikidata item (yet). Of those with known gender, it breaks down as 77.06% male, 22.67% female, and 0.27% some other value. (Because of the limits of the query, I didn’t try and break down those in any more detail.)

This is, as noted, only articles about living people; across all 1,626,232 biographies in the English Wikipedia with a gender known to Wikidata, it’s about 17.83% female, 82.13% male, and 0.05% some other value. I’ll be sticking to data on living people throughout this post, but it’s interesting to compare the historic information.

So, how has that changed over time?

BLPs by gender and date of creation

This graph shows all existing BLPs, broken down by gender and (approximately) when they were created. As can be seen, and as might be expected, the gap has closed a bit over time.

Percentage of BLPs which are female over time

Looking at the ratio over time (expressed here as %age of total male+female), the relative share of female BLPs was ~20% in 2009. In late 2012, the rate of creation of female BLPs kicked up a gear, and from then on it’s been noticeably above the long-term average (almost hitting 33% in late 2017, but dropping back since then). This has driven the overall share steadily and continually upwards, now at 22.7% (as noted above).

Now the second question, do the article lengths differ by gender? Indeed they do, by a small amount.

BLPs by current article size and date of creation

Female BLPs created at any time since 2009 are slightly longer on average than male ones of similar age, with only a couple of brief exceptions; the gap may be widening over the past year but it’s maybe too soon to say for sure. Average difference is about 500 bytes or a little under 10% of mean article size – not dramatic but probably not trivial either. (Pre-2009 articles, not shown here, are about even on average)

Note that this is raw bytesize – actual prose size will be smaller, particularly if an article is well-referenced; a single well-structured reference can be a few hundred characters. It’s also the current article size, not size at creation, hence why older articles tend to be longer – they’ve had more time to grow. It’s interesting to note that once they’re more than about five years old they seem to plateau in average length.

Finally, the third question – have they been nominated for deletion? This was really interesting.

Percentage of BLPs which have previously been to AFD, by date of creation and gender

So, first of all, some caveats. This only identifies articles which go through the structured “articles for deletion” (AFD) process – nomination, discussion, decision to keep or delete. (There are three deletion processes on Wikipedia; the other two are more lightweight and do not show up in an easily traceable form). It also cannot specifically identify if that exact page was nominated for deletion, only that “an article with exactly the same page name has been nominated in the past” – but the odds are good they’re the same if there’s a match. It will miss out any where the article was renamed after the deletion discussion, and, most critically, it will only see articles that survived deletion. If they were deleted, I won’t be able to see them in this analysis, so there’s an obvious survivorship bias limiting what conclusions we can draw.

Having said all that…

Female BLPs created 2009-16 appear noticeably more likely than male BLPs of equivalent age to have been through a deletion discussion at some point in their lives (and, presumably, all have been kept). Since 2016, this has changed and the two groups are about even.

Alongisde this, there is a corresponding drop-off in the number of articles created since 2016 which have associated deletion discussions. My tentative hypothesis is that articles created in the last few years are generally less likely to be nominated for deletion, perhaps because the growing use of things like the draft namespace (and associated reviews) means that articles are more robust when first published. Conversely, though, it’s possible that nominations continue at the same rate, but the deletion process is just more rigorous now and a higher proportion of those which are nominated get deleted (and so disappear from our data). We can’t tell.

(One possible explanation that we can tentatively dismiss is age – an article can be nominated at any point in its lifespan so you would tend to expect a slowly increasing share over time, but I would expect the majority of deletion nominations come in the first weeks and then it’s pretty much evenly distributed after that. As such, the drop-off seems far too rapid to be explained by just article age.)

What we don’t know is what the overall nomination for deletion rate, including deleted articles, looks like. From our data, it could be that pre-2016 male and female articles are nominated at equal rates but more male articles are deleted; or it could be that pre-2016 male and female articles are equally likely to get deleted, but the female articles are nominated more frequently than they should be. Either of these would cause the imbalance. I think this is very much the missing piece of data and I’d love to see any suggestions for how we can work it out – perhaps something like trying to estimate gender from the names of deleted articles?

Update: Magnus has run some numbers on deleted pages, doing exactly this – inferring gender from pagenames. Of those which were probably a person, ~2/3 had an inferred gender, and 23% of those were female. This is a remarkably similar figure to the analysis here (~23% of current BLPs female; ~26% of all BLPs which have survived a deletion debate female)

So in conclusion

  • We know the gender breakdown: skewed male, but growing slowly more balanced over time, and better for living people than historical ones.
  • We know the article lengths; slightly longer for women than men for recent articles, about equal for those created a long time ago.
  • We know that there is something different about the way male and female biographies created before ~2017 experience the deletion process, but we don’t have clear data to indicate exactly what is going on, and there are multiple potential explanations.
  • We also know that deletion activity seems to be more balanced for articles in both groups created from ~2017 onwards, and that these also have a lower frequency of involvement with the deletion process than might have been expected. It is not clear what the mechanism is here, or if the two factors are directly linked.

How can you extract this data? (Yes, this is very dull)

The first problem was generating the lists of articles and their metadata. The English Wikipedia category system lets us identify “living people”, but not gender; Wikidata lets us identify gender (property P21), but not reliably “living people”. However, we can creatively use the petscan tool to get the intersection of a SPARQL gender query + the category. Instructing it to explicitly use Wikipedia (“enwiki” in other sources > manual list) and give output as a TSV – then waiting for about fifteen minutes – leaves you with a nice clean data dump. Thanks, Magnus!

(It’s worth noting that you can get this data with any characteristic indexed by Wikidata, or any characteristic identifiable through the Wikipedia category schema, but you will need to run a new query for each aspect you want to analyse – the exported data just has article metadata, none of the Wikidata/category information)

The exported files contain three things that are very useful to us: article title, pageid, and length. I normalised the files like so:

grep [0-9] enwiki_blp_women_from_list.tsv | cut -f 2,3,5 > women-noheader.tsv

This drops the header line (it’s the only one with no numeric characters) and extracts only the three values we care about (and conveniently saves about 20MB).

This gives us two of the things we want (age and size) but not deletion data. For that, we fall back on inference. Any article that is put through the AFD process gets a new subpage created at “Wikipedia:Articles for deletion/PAGENAME”. It is reasonable to infer that if an article has a corresponding AFD subpage, it’s probably about that specific article. This is not always true, of course – names get recycled, pages get moved – but it’s a reasonable working hypothesis and hopefully the errors are evenly distributed over time. I’ve racked my brains to see if I could anticipate a noticeable difference here by gender, as this could really complicate the results, but provisionally I think we’re okay to go with it.

To find out if those subpages exist, we turn to the enwiki dumps. Specifically, we want “enwiki-latest-all-titles.gz” – which, as it suggests, is a simple file listing all page titles on the wiki. Extracted, it comes to about 1GB. From this, we can extract all the AFD subpages, as so:

grep "Articles_for_deletion/" enwiki-latest-all-titles | cut -f 2 | sort | uniq | cut -f 2 -d / | sort | uniq > afds

This extracts all the AFD subpages, removes any duplicates (since eg talkpages are listed here as well), and sorts the list alphabetically. There are about 424,000 of them.

Going back to our original list of articles, we want to bin them by age. To a first approximation, pageid is sequential with age – it’s assigned when the page is first created. There are some big caveats here; for example, a page being created as a redirect and later expanded will have the ID of its initial creation. Pages being deleted and recreated may get a new ID, pages which are merged may end up with either of the original IDs, and some complicated page moves may end up with the original IDs being lost. But, for the majority of pages, it’ll work out okay.

To correlate pageID to age, I did a bit of speculative guessing to find an item created on 1 January and 1 July every year back to 2009 (eg pageid 43190000 was created at 11am on 1 July 2014). I could then use these to extract the articles corresponding to each period as so:

...
awk -F '\t' '$2 >= 41516000 && $2 < 43190000' < men-noheader.tsv > bins/2014-1-M
awk -F '\t' '$2 >= 43190000 && $2 < 44909000' < men-noheader.tsv > bins/2014-2-M
...

This finds all items with a pageid (in column #2 of the file) between the specified values, and copies them into the relevant bin. Run once for men and once for women.

Then we can run a short report, along these lines (the original had loops in it):

  cut -f 1 bins/2014-1-M | sort > temp-M
  echo -e 2014-1-M"\tM\t"`cat bins/2014-1-M | wc -l`"\t"`awk '{ total += $3; count++ } END { print total/count }' bins/2014-1-M`"\t"`comm -1 -2 temp-M afds | wc -l` >> report.tsv

This adds a line to the file report.tsv with (in order) the name of the bin, the number of entries in it, the mean value of the length column, and a count of the number which also match names in the afds file. (The use of the temp-M file is to deal with the fact that the comm tool needs properly sorted input).

After that, generating the data is lovely and straightforward – drop the report into a spreadsheet and play around with it.

George Ernest Spero, the vanishing MP

As part of the ongoing Wikidata MPs project, I’ve come across a number of oddities – MPs who may or may not have been the same person, people who essentially disappear after they leave office, and so on. Tracking these down can turn into quite a complex investigation.

One such was George Ernest Spero, Liberal MP for Stoke Newington 1923-24, then Labour MP for Fulham West 1929-30. His career was cut short by his resignation in April 1930; shortly afterwards, he was declared bankrupt. Spero had already left the country for America, and nothing more was heard of him. The main ambiguity was when he died – various sources claimed either 1960 or 1976, but without it being clear which was more reliable, or any real details on what happened to him after 1930. In correspondence with Stephen Lees, who has been working on an incredibly useful comprehensive record of MP’s death-dates, I did some work on it last year and eventually confirmed the 1960 date; I’ve just rediscovered the notes from this and since it was an interesting little mystery, thought I’d post them.

George Spero, MP and businessman

So, let’s begin with what we know about him up to the point at which he vanished.

George Ernest Spero was born in 1894. He began training at the Royal Dental Hospital in 1912, and served in the RNVR as a surgeon during the First World War. He had two brothers who also went into medicine; Samuel was a dentist in London (and apparently also went bankrupt, in 1933), while Leopold was a surgeon or physician (trained at St. Mary’s, RNVR towards the end of WWI, still in practice in the 1940s). All of this was reasonably straightforward to trace, although oddly George’s RNVR service records seem to be missing from the National Archives.

After the war, he married Rina Ansley (nee Rina Ansbacher, born 14 March 1902) in 1922; her father was a wealthy German-born stockbroker, resident in Park Lane, who had naturalised in 1918. They had two daughters, Rachel Anne (b. 1923) and Betty Sheila (b. 1928). After his marriage, Spero went into politics in Leicester, where he seems to have been living, and stood for Parliament in the 1922 general election. The Nottingham Journal described him as for “the cause of free, unfettered Liberalism … Democratic in conviction, he stands for the abolition of class differences and for the co-operation of capital and labour.” However, while this was well-tailored to appeal to the generally left-wing voters of Leicester West, and his war record was well-regarded, the moderate vote was split between the Liberal and National Liberal candidates, with Labour taking the seat.

The Conservative government held another election in 1923, aiming to strengthen a small majority (does this sound familiar?), and Spero – now back in London – contested Stoke Newington, then a safe Conservative seat, again as a left Liberal. With support from Labour, who did not contest the seat, Spero ran a successful campaign and unseated the sitting MP. He voted in support of the minority Labour government on a number of occasions, and was one of the small number of Liberal rebels who supported them in the final no-confidence vote. However, this was not enough to prevent Labour fielding a candidate against him in 1924; the Conservative candidate took 57% of the vote, with the rest split evenly between Labour and Liberal.

Spero drifted from the Liberals into the Labour Party, probably a more natural home for his politics, joining it in 1925. By the time of the next general election, in May 1929, he had become the party’s candidate for Fulham West, winning it from the Conservatives with 45% of the vote.

He was a moderately active Government backbencher for the next few months, including being sent as a visitor to Canada during the recess in September 1929, travelling with his wife. While overseas, she caused some minor amusement to the British papers after reporting the loss of a £6,000 pearl necklace – they were delighted to report this alongside “socialist MP”. He was last recorded voting in Hansard in December, and did not appear in 1930. In February and March he was paired for votes, with a newspaper report in early March stating that he had been advised to take a rest to avoid a complete nervous breakdown about the start of the year, and had gone to the South of France, but “hopes to return to Parliament before the month is out”. However, on 9th April he formally took the Chiltern Hundreds (it is interesting that a newspaper report suggested his local party would choose whether to accept the resignation).

However, things were moving quickly elsewhere. A case was brought against him in the High Court for £10,000, arising from his sale of a radio company in 1928-29. During the court hearing, at the end of May, it was discovered that a personal cheque for £4000 given by Spero to guarantee the company’s debts had been presented to his bank in October 1929, but was not honoured. He had at this point claimed to be suing the company for £20,000, buying six months legal delay, sold his furniture, and – apparently – left the country for America. Bankruptcy proceedings followed later that year (where he was again stated to be in America) and, unsurprisingly, his creditors seem to have received very little.

At this point, the British trail and the historic record draw to a gentle close. But what happened to him?

The National Portrait Gallery gave his death as 1960, while an entry in The Palgrave Dictionary of Anglo-Jewish History reported that they had traced his death to 1976 in Belgrade, Yugoslavia (where, as a citizen, it was registered with the US embassy). Unfortunately, it did not go into any detail about how they worked this out, and this just heightened the mystery – if it was true, how had a disgraced ex-MP ended up in Yugoslavia on a US passport three decades later? And, conversely, who was it had died in 1960?

George Spears, immigrant and doctor

We know that Spero went to America in 1929-30; that much seemed to be a matter of common agreement. Conveniently, the American census was carried out in April 1930, and the papers are available. On 18 April, he was living with his family in Riverside Drive, upper Manhattan; all the names and ages line up, and Spero is given as a medical doctor, actively working. Clearly they were reasonably well off, as they had a live-in maid, and it seems to be quite a nice area.

In 1937, he petitioned for American citizenship in California, noting that he had lived there since March 1933. As part of the process, he formally notified that he intended to change his name to George Ernest Spears. (He also gave his birthdate as 2 March 1894, of which more later).

While we can be reasonably confident these are the same man due to the names and dates of the family, the match is very neatly confirmed by the fact that the citizenship papers have a photograph, which can be compared to an older newspaper one. There is fifteen years difference, but we can see the similarities between the prospective MP of 27 and the older man of 43.

George Spears, with the same family, then reappears in the 1940 census, back in Riverside Drive. He is now apparently practicing as an optician, and doing well – income upwards of $6000. Finally, we find a draft record for him living in Huntingdon, Long Island at some point in 1942. Note his signature here, which is visibly the same hand as in 1937, except “E. Spears” not “Ernest Spero”.

It is possible he reverted to his old name for a while – there are occasional appearances of a Dr. George Spero, optometrist, in the New York phone books between the 1940s and late 1950s. Not enough detail to be sure either way, though.

So at this point, we can trace Spero/Spears continually from 1930 to 1942. And then nothing, until on 7 January 1960, George E. Spears, born 2 March 1894, died in California. Some time later, in June 1976, George Spero, born 11 April 1894, died in Belgrade, Yugoslavia, apparently a US citizen. Which one was our man?

The former seemed more likely, but can we prove it? The death details come from an index, which gives a mother’s maiden name of “Robinson” – unfortunately the full certificate isn’t there and I did not feel up to trying to track down a paper Californian record to see what else it said.

If we return to the UK, we can find George Spero in the 1901 census in Dover, with his parents Isidore Sol [Solomon], a ‘dental mechanic’, and Rachel, maiden name unknown. The family later moved to London, the parents naturalised, Isidore died in 1925 – and probate goes to “George Ernest Spero, physician”, which seems to confirm that this is definitely the right family and not a different George Spero. The 1901 censuses note that two of the older children were born in Dublin, so we can trace them in the Irish records. Here we have an “Israel S Spero” marrying Rachel Robinson in 1884, and a subsequent child born to Solomon Israel Spero and Rachel Spero nee Robinson. There are a few other Speros or Spiros appearing in Dublin, but none married around the right time, and none with such similar names. If Israel Solomon Spero is the same as Isidore Solomon Spero, this all ties up very neatly.

It leaves open the mystery, however, of who died in Yugoslavia. It seems likely this was a completely different man (who had not changed his name), but I have completely failed to trace anything about him. A pity – it would have been nice to definitively close off that line of enquiry.

What’s in a name? MPs and their preferred titles

A quick skim of the list of members in Hansard shows that there is no consistency in how it refers to politicians – some are Ms Jane Smith, others are merely John Brown.

My understanding – I welcome corrections! – is that this is ultimately personal choice. MPs are asked to choose how they are described in Hansard, with the option for a title. (I am not sure quite how this process works, but I assume there is a form; there always is.) This decision eventually percolates through to all of the data produced by Parliament. Of course, “personal choice” might just be “whatever they [or their assistant] happened to think was expected when filling out the form”, rather than a conscious and deliberate choice.

So, what do people do? A 2010 Commons factsheet says vaguely that “A few Members of both sexes have requested that no title be used (e.g. Jennifer Jones MP” but a cursory glance down the list shows it’s more than “a few”. It turns out the full data is available from data.parliament.uk (as a big blob of XML) and so we can actually do some stats on this.

For current MPs, 145 of 650 have a preferred title (based on current data not past preferences). 33 Sir, 6 Dame, 17 Dr, 7 Ms, 11 Mrs, 71 Mr. Overall, 78% of MPs do not have a preferred title.

Of those, 44 are Labour, 93 Conservative, 3 LD, 2 SNP, 2 DUP, 1 Independent. So 17% of Labour MPs have a preferred title and 29% of Conservatives. Split by gender, 15% of women (32 of 209) list a title, versus 25% of men (113 of 441).

Of course, in some circumstances you don’t really have a choice – it would be a bit odd to say “I’d rather not be Sir X” once you’ve accepted a knighthood. Omitting anyone who’s a knight or dame, it becomes 20% of Conservatives & 15% of Labour having preferred titles, & overall 20% of men and 13% of women. The general proportions are broadly the same but the Labour-Conservative gap has narrowed a bit.

Doctors are an interesting question. Some PhD’ed MPs make a point of using their doctorate, but many others don’t. (They are in good company if so – the world’s most prominent doctorate-holding politician doesn’t, either). A couple of years ago, Chris Brooke tried to track down every current MP with a PhD. I took his list (with post-2017 updates), and a paper in the BMJ listing new medical MPs after the last election, and pulled together a total of 31 MPs who could be “Dr” – 21 have PhDs (or similar), 10 are medical doctors of various forms. (I have counted one with a D.Clin.Psy as “medical” rather than “like a PhD”). We’ve seen that only 17 people list themselves as Dr – who are they?

It turns out that every medical doctor uses “Dr” as their title, but only a third of PhDs (7 of 21). Two of the PhDs are “Sir”, but didn’t appear to use the title before getting knighthoods, and one sticks firmly to “Mr”; the rest are blank.

Across the parties, the Conservatives have six medics (all Dr) and 11 PhDs (three Dr); Labour have two medics (all Dr) and eight PhDs (four Dr). Not really enough to say anything confident about the difference between the parties.

Lastly, there’s the question of the change over time. Interestingly Paul Seaward noted that in the 1990s, the trend was for new doctoral MPs to use “Dr” for a few months and then quietly drop it.

The raw XML includes a note of the change of style by date since c. 2010 (presumably so that you can check you’re using a time-appropriate form if needed). It’s a bit noisy because it seems to have a lot of back-and-forth changes around election dates, which probably hints at changes not purely initiated by the Members themselves. Given this complicating the data I’d be cautious about drawing any conclusions from it without much more careful examination, but perhaps in a few years time we can start saying things about whether Members’ titles are indeed becoming gradually less common, or if it turns out that mostly not using them is a fashion that comes and goes…

At-risk content on Flickr

Flickr has recently announced that it will be cutting back storage for its free accounts; as of early 2019, they will be limited to 1000 images, and any files beyond that limit will be progressively deleted.

Personally speaking, this surprised me a little bit, because I’d forgotten they’d removed the 200-image limit a few years ago. I am generally quite comfortable with the idea of them imposing a capacity limit and charging to go beyond that; it’s a fair way to price your service, and ultimately, it has to be paid for. But retroactive deletion is a bit unfortunate (especially if handled as an abrupt guillotine).

A few people raised the reasonable question – how much material is now at risk? A huge chunk of Wikimedia Commons material is sourced from Flickr (imported under free licenses) and, in addition, there is the reasonably successful Flickr Commons program for image hosting from cultural institutions.

Looking at the 115 Flickr Commons accounts shows that there are ~480,000 images from the 54 Pro accounts, and ~6,450,000 from the 61 non-Pro accounts. This seems a very dramatic difference, but on closer examination the British Library and Internet Archive (both non-Pro accounts) make up the vast majority of this, with ~6,350,000 images, mostly extracts from digitized book images. Flickr have since stated that Flickr Commons accounts will not be affected (it will be interesting to see if they now expand the program to include many of the other institutional accounts).

For “normal” users, it’s a bit harder to be sure. Flickr state that “the overwhelming majority of Pros have more than 1,000 photos on Flickr, and more than 97% of Free members have fewer than 1,000”. But from the Commons perspective, what we really want to know is “what proportion of the kind thing we want to import is at risk?” Looking at this type of material is potentially quite interesting – it goes beyond the simple “Flickr as a personal photostore” and into “Flickr as a source of the cultural commons”.

So, analysis time! I pulled a list of all outbound links from Commons. For simplicity, I didn’t try to work out which of these were links from file pages as opposed to navigational/maintenance/user pages, but a quick sanity-check suggests that the vast majority of pages with outbound Flickr links are file descriptions – something like 99.7% – so it seems reasonable to just take the whole lot. I then extracted any flickr userIDs I could find, either in links to author profiles or in image URLs themselves, (eg 12403504@N02), and deduplicated the results so we ended up with a pile of userID-page pairs. The deduplication was necessary because a raw count of links can get quite confusing – some of the Internet Archive imports can have 20-30 links per file description page, and one of the British Library map maintenance pages has 9500…

One critical omission here is that I only took “raw” userIDs, not pretty human-readable ones (like “britishlibrary”); this was for practical reasons because I couldn’t easily link the two together. Many items are only linked with human-readable labels in the URLs, but ~96% of pages with an outbound Flickr link have at least one identifiable userID on them, so hopefully the remaining omissions won’t skew the results too much. (I also threw out any group IDs at this point to avoid confusion.)

I used this to run two analyses. One was the most frequently used userIDs – this was the top 5021 userIDs in our records, any ID that had links from approximately ~80 pages or more. The other was a random sample of userIDs – 5000 randomly selected from the full set of ~79000. With each sample, I used the number of links on Commons as a proxy for the number of images (which seems fair enough).

Among the most frequently used source accounts, I found that 50% of images came from Pro accounts, 35% from “at risk” free accounts (more than 1000 images), 3% from “safe” free accounts (under 1000 images), 11% from Flickr Commons (both pro & non-Pro), and 1% were from accounts that are now deactivated or have no images.

In the random sample, I found a somewhat different spread – 60% of images were from Pro accounts, 32% from “at risk” free accounts, 6% from “safe” free accounts, 2% Flickr Commons, and 0.25% missing.

Update: an extended sample of all accounts with ten or more links (19374 in total) broadly resembles the top 5000 – 49% Pro accounts, 35% “at risk” free accounts, 4.5% “safe” free accounts, 10% Flickr Commons accounts, and 1.5% missing.

So, some quick conclusions –

  • Openly-licensed material gathered from Flickr is a significant source for Commons – something like 7.5m file description pages link to Flickr, almost certainly as a source, about 15% of all files
  • A substantial amount of material sourced from Flickr comes from a relatively small number of accounts, some institutional and some personal (this was the most common one in my random sample – 58k images)
  • A substantial portion of our heavily used Flickr source accounts are potentially at risk (note that it is not possible to tell how many were once Pro, have lapsed because why bother when it’s free, and may resume paying)
  • It is not as catastrophic as it might at first appear – the samples all suggest that only about a third of potential source images are at risk, once the Flickr Commons accounts are exempted from the limits – which seems to be the plan.
  • Having said that, the figure of 97% of individual free accounts having under a thousand images is no doubt accurate, but probably masks the sheer number of images in many of the larger accounts.

Some things that would potentially still be very interesting to know –

  • What proportion of freely-licensed images are from at-risk accounts?
  • What proportion of images in at-risk accounts are actually freely-licensed?
  • What proportion of freely-licensed images on Flickr have (or could) be transferred over to Commons?
  • Are Flickr Commons accounts exempt from the size restriction? (As there are only ~150 of them, this seems plausible as a special case…)

First authors in Scopus

I’ve been doing a lot of bibliometric work recently. One task that I bounced off a few times before figuring out a practical approach was statistics on first authors; since I’ve finally figured it out, it seemed worth making a note of it. This uses Scopus and some very basic bash shell commands.

Let’s say we want to find out what proportion of papers from the University of York in 2007 had York-affiliated first authors. At first glance, this is a simple problem – Web of Science or Scopus will give you a list of affiliations for each paper, and as far as I know they’re listed in order of appearance; so download that, sort it, count all the ones that start with York, you’re done.

Unfortunately, you get people with double affiliations. Are there enough of them to be significant? For a small institution, quite possibly. It means we can’t use Web of Science, as their data – while wonderfully sorted and deduplicated against institutions – just says “this paper has these affiliations”.

Scopus, however, associates affiliations to authors. This means that you can reliably pick any given author for a paper and report what their affiliations are. (It also means that you can do some weighting – five authors from X and one from Y may not be the same as one from X and five from Y in your particular scenario).

Log into Scopus, run your search. Export the results, making sure to select “Affiliations” from the menu, and filetype CSV. It does not work well with sets of more than 2000 papers, so you may want to do some careful subdivision of your query. Thankfully, our example has 1848 results…

The result is a bit messy, because CSVs… well, they’re messy. Let’s convert it into a nice TSV. Create a file to contain this very short python script:

#!/usr/bin/env python

import csv, sys
csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))

Then run cat scopus.csv | ./csv2tsv > scopus.tsv

Occasionally you can get papers with ludicrous numbers of authors, all of whom have their affiliations in a single field, and trying to import this into a spreadsheet gets messy – I think the record I had was something like 44k of text in a single name/affiliation field. So we’ll do this all from the command line.

First off, let’s check the file is the right length. wc -l scopus.tsv should give 1849 – one greater than the expected total because we still have a header column.

Now then, let’s look at the author/affiliation field. cut -f 15 scopus.tsv will extract this. The thing to note here is that the individual authors are separated by semicolons, while multiple affiliations for the same author are separated only by commas. So if we want to extract the first author, all we need to do is extract everything before the first semicolon – cut -f 15 scopus.tsv | cut -f 1 -d \;

Now, we want to find out how many of those match York. cut -f 15 scopus.tsv | cut -f 1 -d \; | grep "University of York" will find all those with the university name in the extracted affiliation; we can count the lines with cut -f 15 scopus.tsv | cut -f 1 -d \; | grep "University of York" | wc -l

Hey presto – 1200 exactly. Of our 1848 papers in 2007, 1200 (65%) had a York-based first author.

Wait, you cry, that sounds a pretty impressive number – but how many of those were single-authored papers? We can answer that, too. The first field simply seperates all authors with commas, so any author field with commas must have multiple authors. cut -f 1 scopus.tsv | grep \, | wc -l – and we get 1511.

So of the 1848 papers York published that year, 337 were single-authored. Of the remaining 1511, 863 (57%) were led by York authors.

And while we’re on that topic – how many authors were there, on average? Again, our friend the comma steps in. cut -f 1 scopus.tsv | sed 's/\,/\n/g' | wc -l switches every comma for a linebreak, so each author on each paper gets a new line, then counts the results. 8384 – except as you’ll remember we still have a header row, and it will be included in the results because there’s no grep to filter it out, so 8383. Across 1848 papers, that’s an average of 4.5 authors/paper.

Now, the big caveat. Affiliations are free-text addresses. They are entered more or less as on the original paper, so if someone makes a really silly mistake and gets entire bits of their institution name wrong, this may end up perpetuated in the database. There is some standardisation, but it’s not perfect – five 2007 papers turn out to match “univ. of york” but not “university of york”, and so did not make it into our search data. Five of the “University of York” affiliations, on close examination, turn out to match the Canadian York not the British one. So you need to be cautious. But the broad results are certainly good enough to be going on with!

On Article 50

Since early last year, I live in Islington North (pretty much bang in the middle of it, in fact). This means that my MP is one Jeremy Corbyn.

This is a bit odd. I’ve never been represented by someone who wasn’t a backbencher before (at least, not at Westminster; for a brief while years ago my MSP was the subduedly-titled Deputy Minister for Justice). It also means that there is very little reason for me to ever write to my MP – his positions on something are usually front-page news and for any given topic I can figure out pretty quickly that either he’s already made a statement supporting it or disagrees with me entirely.

But, the Article 50 vote looms, and I felt I ought to do it for once. I know he disagrees with me; I know he’s whipped his party that way. The letter is a cry in the dark. But, well, you do what you must do.

Dear Mr. Corbyn,

I am writing in regard to the Article 50 second reading vote scheduled for Wednesday February 1st. As a constituent, I urge you to reconsider your position on this bill, and to vote against it at the second reading.

Firstly, I wish to remind you that around 75% of your constituents voted Remain, on a turnout of 70%. Not only was Islington one of the most strongly pro-EU areas of the country, this was a larger share of the electorate than you yourself have ever received from the constituency – and it has always been a solidly Labour seat. This is a remarkable result, and I feel it is only proper that you acknowledge your constituents’ clearly expressed position here.

Secondly, on pragmatic grounds, this bill is likely to pass without significant amendments, and thus without any controls on Brexit barring those imposed by a weak Prime Minister. As such, it is essentially handing a blank cheque to the hard right of the Conservative Party, giving them the carte blanche to engineer a Brexit most suited to their desired outcomes – xenophobic, intolerant, and devastating to the British people. This is a horrendous prospect.

Rejecting this bill at second reading will not stop Brexit and will not invalidate the referendum. However, rejecting the bill will have a decent chance of forcing these discussions to be open, to take place on a cross-party basis, and ensure that what emerges has a chance of being positive for us all.

Thirdly, the wider context. Internationally, the world has changed dramatically since last summer. Europe, with all its flaws, is a beacon of light and sanity compared to the United States, our closest non-EU ally. As you yourself noted yesterday, the Prime Minister’s support for Donald Trump places her firmly on the wrong side of history.

And in this light, the referendum result has some resonance. You were one of 84 Labour members to defy a whip and vote against the invasion of Iraq. A poll conducted the same day found about 50-55% of the country in favour of the war – the same number that voted to leave the EU.

A slim majority of the country – and the government – got it wrong a decade ago. We are not infallible. Sometimes, we all take the wrong steps and put ourselves on the wrong side of history. Now is a chance to put the brakes on and decide quite what we are doing, to move slowly and contemplatively, before continuing further.

I urge you to vote against this bill.

Open access and the Internet Archive

Late last year, I wanted to find out when the first article was published by F1000 Research. I idly thought, oh, rather than try and decipher their URLs or click “back” through their list of articles fifty times, I’ll go and look at the Internet Archive. To my utter astonishment, they’re not on it. From their robots.txt, buried among a list of (apparently) SEO-related crawler blocks –

User-agent: archive.org_bot
Disallow: /

The Internet Archive is well-behaved, and honours this restriction. Good for them. But putting the restriction there in the first place is baffling – surely a core goal of making articles open-access is to enable distribution, to ensure content is widely spread. And before we say “but of course F1000 won’t go away”, it is worth remembering that of 250 independently-run OA journals in existence in 2002, 40% had ceased publishing by 2013, and almost 10% had disappeared from the web entirely (see Björk et al 2016, table 1). Permanence is not always predictable, and backups are cheap.

Their stated backup policy is that articles (and presumably reviews?) are stored at PMC, Portico, and in the British Library. That’s great. But that’s just the articles. Allowing the IA to index the site content costs nothing, it provides an extra backup, and it ensures that the “context” of the journal – authorial instructions, for example, or fees – remains available. This can be very important for other purposes – I couldn’t have done my work on Elsevier embargoes without IA copies of odd documents from their website, for example.

And… well, it’s a bit symbolic. If you’re making a great thing of being open, you should take that to its logical conclusion and allow people to make copies of your stuff. Don’t lock it away from indexing and crawling. PLOS One have Internet Archive copies. So do Nature Communications, Scientific Reports, BMJ Open, Open Library of the Humanities, PeerJ. In fact, every prominent all-OA title I’ve checked happily allows this. Why not F1000? Is it an oversight? A misunderstanding? I find it hard to imagine it would be a deliberate move on their part…

History of Parliament and Wikidata – the first round complete

Back in January, I wrote up some things I was aiming to do this year, including:

Firstly, I’d like to clear off the History of Parliament work on Wikidata. I haven’t really written this up yet (maybe that’s step 1.1) but, in short, I’m trying to get every MP in the History of Parliament database listed and crossreferenced in Wikidata. At the moment, we have around 5200 of them listed, out of a total of 22200 – so we’re getting there. (Raw data here.) Finding the next couple of thousand who’re listed, and mass-creating the others, is definitely an achievable task.

Well, seven months later, here’s where it stands:

  • 9,372 of a total 21,400 (43.7%) of History of Parliament entries been matched to records for people in Wikidata.
  • These 9,372 entries represent 7,257 people – 80 have entries in three HoP volumes, and 1,964 in two volumes. (This suggests that, when complete, we will have about ~16,500 people for those initial 21,400 entries – so maybe we’re actually over half-way there).
  • These are crossreferenced to a lot of other identifiers. 1,937 of our 7,257 people (26.7%) are in the Oxford Dictionary of National Biography, 1,088 (15%) are in the National Portrait Gallery database, and 2,256 (31.1%) are linked to their speeches in the digital edition of Hansard. There is a report generated each night crosslinking various interesting identifiers.
  • Every MP in the 1820-32 volume (1,367 of them) is now linked and identified, and the 1790-1820 volume is now around 85% complete. (This explains the high showing for Hansard, which covers 1805 onwards)
  • The metadata for these is still limited – a lot more importing work to do – but in some cases pretty decent; 94% of the 1820-32 entries have a date of death, for example.

Of course, there’s a lot more still to do – more metadata to add, more linkages to make, and so on. It still does not have any reasonable data linking MPs to constituencies, which is a major gap (but perhaps one that can be filled semi-automatically using the HoP/Hansard links and a clever script).

But as a proof of concept, I’m very happy with it. Here’s some queries playing with the (1820-32) data:

  • There are 990 MPs with an article about them in at least one language/WM project. Strikingly, ten of these don’t have an English Wikipedia article (yet). The most heavily written-about MP is – to my surprise – David Ricardo, with articles in 67 Wikipedias. (The next three are Peel, Palmerston, and Edward Bulwer-Lytton).
  • 303 of the 1,367 MPs (22.1%) have a recorded link to at least one other person in Wikidata by a close family relationship (parent, child, spouse, sibling) – there are 803 links, to 547 unique people – 108 of whom are also in the 1820-32 MPs list, and 439 of whom are from elsewhere in Wikidata. (I expect this number to rise dramatically as more metadata goes in).
  • The longest-surviving pre-Reform MP (of the 94% indexed by deathdate, anyway) was John Savile, later Earl of Mexborough, who made it to August 1899…
  • Of the 360 with a place of education listed, the most common is Eton (104), closely followed by Christ Church, Oxford (97) – there is, of course, substantial overlap between them. It’s impressive to see just how far we’ve come. No-one would ever expect to see anything like that for Parliament today, would we.
  • Of the 1,185 who’ve had first name indexed by Wikidata so far, the most popular is John (14.4%), then William (11.5%), Charles (7.5%), George (7.4%), and Henry (7.2%):

  • A map of the (currently) 154 MPs whose place of death has been imported:

All these are of course provisional, but it makes me feel I’m definitely on the right track!


So, you may be asking, what can I do to help? Why, thankyou, that’s very kind…

  • First of all, this is the master list, updated every night, of as-yet-unmatched HoP entries. Grab one, load it up, search Wikidata for a match, and add it (property P1614). Bang, one more down, and we’re 0.01% closer to completion…
  • It’s not there? (About half to two thirds probably won’t be). You can create an item manually, or you can set it aside to create a batch of them later. I wrote a fairly basic bash script to take a spreadsheet of HoP identifiers and basic metadata and prepare it for bulk-item-creation on Wikidata.
  • Or you could help sanitise some of the metadata – here’s some interesting edge cases:
    • This list is ~680 items who probably have a death date (the HoP slug ends in a number), but who don’t currently have one in Wikidata.
    • This list is ~540 people who are titled “Honourable” – and so are almost certainly the sons of noblemen, themselves likely to be in Wikidata – but who don’t have a link to their father. This list is the same, but for “Lord”, and this list has all the apparently fatherless men who were the 2nd through 9th holders of a title…

Open questions about the costs of the scholarly publishing system

Stuart Lawson (et al)’s new paper on “Opening the Black Box of Scholarly Communication Funding” is now out – it’s an excellent contribution to the discussion and worth a read.

From their conclusion:

The current lack of publicly available information concerning financial flows around scholarly communication systems is an obstacle to evidence-based policy-making – leaving researchers, decision-makers and institutions in the dark about the implications of current models and the resources available for experimenting with new ones.

It prompts me to put together a list I’ve been thinking about for a while – what do we still need to know about the scholarly publishing market?

  • What are the actual totals of author/institutional payments to publishers outside of subscriptions and APCs – page charges, colour charges, submission fees, and so on? I have recently estimated that for the UK this is on the order of a few million pounds per year, but that’s very provisional, and doesn’t include things like reprint payments or delve into the different local practices. All we can say for sure at this stage is “yes, it’s still non-trivial, more work needed”.
  • What are the overall amounts paid by readers to publishers and aggregators for pay-per-view articles? In 2011 I found that (for JSTOR at least) the numbers are vanishingly small. I’ve not seen much other investigation of this, surprisingly – or have I just missed it?
  • Can an overall value be put on the collective “journal support” costs – for example, subsidies from a scholarly society or institution to keep their journal afloat, or grants from funding bodies directly for operating journals? This money fills a gap between subscriptions and publication costs, and is essential to keep many journals operating, but is often skimmed over.
  • How closely do quoted APC prices reflect actual costs paid? After currency fluctuation, VAT, and sometimes offset membership discounting, these can vary widely, which can make it very difficult to anticipate the actual amount which will be invoiced. (A special prize for demonstrating the point here goes to the unnamed publisher who invoices in Euro for a list price in USD, and including annotations showing a GBP tax calculation). Reporting tends to be based on actual price paid, which helps, but a lot of policy and theory is based on list-price estimates.
  • How are double-dipping/hybrid offsetting systems working out, now they’ve had a couple of years to bed in? There has been quite a bit of discussion looking at the top-level figures (total subscriptions paid plus total APCs paid) which suggests that the answer is “total amounts paid are still rising”, which is probably correct. However, there’s very little looking in detail at per-journal costs, how the offsets (if any) are calculated, and whether or not the mechanisms used make sense given the relatively low number of hybrid articles in any given journal. Work here could help come up with a standard way of calculating offsets, which could be used in future negotiations. Hybrids won’t be going away any time soon…
  • What contribution to the subscription\publishing charges market comes from outside academia? We tend to focus on university payments (as these are both substantial and reasonably well-documented) but there are very large markets for subscription academic material in, for example, medicine, scientific industry, and law. These are not well understood.

And, finally, the big one:

  • How much does it cost (indirectly/implicitly) to maintain the current subscription-based system? We have a decent idea of how much the indirect costs of gold/green open access are, thanks to recent work on the ‘total cost of publication’, but no idea of the indirect costs of the status quo. And we really, really need to figure it out.

To illustrate that last point, and why I think it’s important…

A large number of librarians (and others) spend much of their time maintaining access systems, handling subscription payments, negotiating usage agreements, fixing user access problems, and so on. Then the publishers themselves have to pay staff to develop and maintain these systems, handle negotiations, deal with payments, etc. Centralised services like JISC’s collective negotiation mean more labour, and some centralised services like ATHENS can be surprisingly expensive to use.

Let’s make a wild guess that it comes down to one FTE staff member per university (it probably isn’t that much work for Chester, but it’s a lot more for Cambridge, so it might balance out); that’s about 130 in the UK. Ten more for all the non-university institutions. Five more for the central services. Five each at the five biggest publishers and another ten for all the others. Total – for our wild estimate – 180 FTE staff. (While the publisher staff aren’t paid by the universities, they’re ultimately paid out of the cost of subscriptions, and so it’s reasonable to consider them part of the overall system cost.)

This number compares interestingly with the 192 FTE that it was estimated would be needed to deal with the administration of making all 140,000 UK research papers gold OA – they’re certainly in the same ballpark, given the wide margins of error. It has substantial implications for any “just switch everything”-type proposals, for obvious reasons, but would also be a very interesting result in and of itself.

Lee of Portrush: an introduction

One of the projects I’ve been meaning to get around to for a while is scanning and dating a boxful of old cabinet photographs and postcards produced by Lee of Portrush in the late nineteenth and early twentieth century.

At least five members and three generations of the Lee family worked as professional photographers in this small Northern Irish town – the last of them was my grandfather, William Lee, who carried the business on into the 1970s. Their later output doesn’t turn up much – I don’t think I’ve run across anything post-1920s – but a steady trickle of their older photographs appear on ebay and on family history sites. They produced a range of monochrome and coloured postcards of Portrush and the surrounding area, did a good trade in portrait photographs, and at one point ended up proprietors of (both temperance and non-temperance) hotels. Briefly, one brother decamped to South Africa (before deciding to come home again) and they proudly announced “Portrush, Coleraine, and Cape Town” – a combination rarely encountered. A more unusual line of work, however, was that they had a studio at the Giant’s Causeway.

The Causeway is the only World Heritage Site in Northern Ireland, and was as popular a tourist attraction then as now. A narrow-gauge electric tramline was built out from Portrush to Bushmills and then the Causeway in the 1880s, bringing in a sharp increase in visitors. And – because the Victorians were more or less the same people as we are now – they decided there was no better way to respond to a wonder of the natural world than to have your photograph taken while standing on it, so that you can show it to all your friends. Granted, you had to pay someone to take the photo, sit still with a rictus grin, then wait for them to faff around with wet plates and developer; not quite an iPhone selfie, but the spirit is the same even if the subjects were wearing crinolines. There is nothing new in this world.

The Lees responded cheerfully to this, and in addition to the profitable postcard trade, made a great deal of money by taking photographs of tourists up from Belfast or Dublin, or even further afield. (They then lost it again over the years; Portrush was not a great place for long-term investment once holidays to the Mediterranean became popular.)

Many of these are sat in shoeboxes; some turn up occasionally on eBay, where I buy them if they’re a few pounds. It’s a nice thing to have, since so little else survives of the business. One problem is that very few are clearly dated, and as all parts of the family seem to have used “Lees Studio”, or a variant, it’s not easy to put them in order, or to give a historical context. For the people who have these as genealogical artefacts, this is something of a problem – ideally, we’d be able to say that this particular card style was early, 1880-1890, that address was later, etc., to help give some clues as to when it was taken.

Fast forward a few years. Last November, I had an email from John Kavanaugh, who’d found a Lee photograph of his great-great-grandfather (John Kavanagh, 1822-1904), and managed to recreate the scene on a visit to the Causeway:

Family resemblance, 1895-2015
Courtesy John Kavanaugh/Efren Gonzalez

It’s quite striking how similar the two are. The stone the elder John was sat on has now crumbled, fallen, or been moved, but the rock formations behind him are unchanged. The original photo is dated c. 1895, so this covers a hundred and twenty years and five generations.

So, taking this as a good impetus to get around to the problem, I borrowed a scanner yesterday and set to. Fifty-odd photographs later, I’ve updated the collection on flickr, and over the next few posts I’ll try and draw together some notes on how to date them.