First authors in Scopus

I’ve been doing a lot of bibliometric work recently. One task that I bounced off a few times before figuring out a practical approach was statistics on first authors; since I’ve finally figured it out, it seemed worth making a note of it. This uses Scopus and some very basic bash shell commands.

Let’s say we want to find out what proportion of papers from the University of York in 2007 had York-affiliated first authors. At first glance, this is a simple problem – Web of Science or Scopus will give you a list of affiliations for each paper, and as far as I know they’re listed in order of appearance; so download that, sort it, count all the ones that start with York, you’re done.

Unfortunately, you get people with double affiliations. Are there enough of them to be significant? For a small institution, quite possibly. It means we can’t use Web of Science, as their data – while wonderfully sorted and deduplicated against institutions – just says “this paper has these affiliations”.

Scopus, however, associates affiliations to authors. This means that you can reliably pick any given author for a paper and report what their affiliations are. (It also means that you can do some weighting – five authors from X and one from Y may not be the same as one from X and five from Y in your particular scenario).

Log into Scopus, run your search. Export the results, making sure to select “Affiliations” from the menu, and filetype CSV. It does not work well with sets of more than 2000 papers, so you may want to do some careful subdivision of your query. Thankfully, our example has 1848 results…

The result is a bit messy, because CSVs… well, they’re messy. Let’s convert it into a nice TSV. Create a file to contain this very short python script:

#!/usr/bin/env python

import csv, sys
csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))

Then run cat scopus.csv | ./csv2tsv > scopus.tsv

Occasionally you can get papers with ludicrous numbers of authors, all of whom have their affiliations in a single field, and trying to import this into a spreadsheet gets messy – I think the record I had was something like 44k of text in a single name/affiliation field. So we’ll do this all from the command line.

First off, let’s check the file is the right length. wc -l scopus.tsv should give 1849 – one greater than the expected total because we still have a header column.

Now then, let’s look at the author/affiliation field. cut -f 15 scopus.tsv will extract this. The thing to note here is that the individual authors are separated by semicolons, while multiple affiliations for the same author are separated only by commas. So if we want to extract the first author, all we need to do is extract everything before the first semicolon – cut -f 15 scopus.tsv | cut -f 1 -d \;

Now, we want to find out how many of those match York. cut -f 15 scopus.tsv | cut -f 1 -d \; | grep "University of York" will find all those with the university name in the extracted affiliation; we can count the lines with cut -f 15 scopus.tsv | cut -f 1 -d \; | grep "University of York" | wc -l

Hey presto – 1200 exactly. Of our 1848 papers in 2007, 1200 (65%) had a York-based first author.

Wait, you cry, that sounds a pretty impressive number – but how many of those were single-authored papers? We can answer that, too. The first field simply seperates all authors with commas, so any author field with commas must have multiple authors. cut -f 1 scopus.tsv | grep \, | wc -l – and we get 1511.

So of the 1848 papers York published that year, 337 were single-authored. Of the remaining 1511, 863 (57%) were led by York authors.

And while we’re on that topic – how many authors were there, on average? Again, our friend the comma steps in. cut -f 1 scopus.tsv | sed 's/\,/\n/g' | wc -l switches every comma for a linebreak, so each author on each paper gets a new line, then counts the results. 8384 – except as you’ll remember we still have a header row, and it will be included in the results because there’s no grep to filter it out, so 8383. Across 1848 papers, that’s an average of 4.5 authors/paper.

Now, the big caveat. Affiliations are free-text addresses. They are entered more or less as on the original paper, so if someone makes a really silly mistake and gets entire bits of their institution name wrong, this may end up perpetuated in the database. There is some standardisation, but it’s not perfect – five 2007 papers turn out to match “univ. of york” but not “university of york”, and so did not make it into our search data. Five of the “University of York” affiliations, on close examination, turn out to match the Canadian York not the British one. So you need to be cautious. But the broad results are certainly good enough to be going on with!

On Article 50

Since early last year, I live in Islington North (pretty much bang in the middle of it, in fact). This means that my MP is one Jeremy Corbyn.

This is a bit odd. I’ve never been represented by someone who wasn’t a backbencher before (at least, not at Westminster; for a brief while years ago my MSP was the subduedly-titled Deputy Minister for Justice). It also means that there is very little reason for me to ever write to my MP – his positions on something are usually front-page news and for any given topic I can figure out pretty quickly that either he’s already made a statement supporting it or disagrees with me entirely.

But, the Article 50 vote looms, and I felt I ought to do it for once. I know he disagrees with me; I know he’s whipped his party that way. The letter is a cry in the dark. But, well, you do what you must do.

Dear Mr. Corbyn,

I am writing in regard to the Article 50 second reading vote scheduled for Wednesday February 1st. As a constituent, I urge you to reconsider your position on this bill, and to vote against it at the second reading.

Firstly, I wish to remind you that around 75% of your constituents voted Remain, on a turnout of 70%. Not only was Islington one of the most strongly pro-EU areas of the country, this was a larger share of the electorate than you yourself have ever received from the constituency – and it has always been a solidly Labour seat. This is a remarkable result, and I feel it is only proper that you acknowledge your constituents’ clearly expressed position here.

Secondly, on pragmatic grounds, this bill is likely to pass without significant amendments, and thus without any controls on Brexit barring those imposed by a weak Prime Minister. As such, it is essentially handing a blank cheque to the hard right of the Conservative Party, giving them the carte blanche to engineer a Brexit most suited to their desired outcomes – xenophobic, intolerant, and devastating to the British people. This is a horrendous prospect.

Rejecting this bill at second reading will not stop Brexit and will not invalidate the referendum. However, rejecting the bill will have a decent chance of forcing these discussions to be open, to take place on a cross-party basis, and ensure that what emerges has a chance of being positive for us all.

Thirdly, the wider context. Internationally, the world has changed dramatically since last summer. Europe, with all its flaws, is a beacon of light and sanity compared to the United States, our closest non-EU ally. As you yourself noted yesterday, the Prime Minister’s support for Donald Trump places her firmly on the wrong side of history.

And in this light, the referendum result has some resonance. You were one of 84 Labour members to defy a whip and vote against the invasion of Iraq. A poll conducted the same day found about 50-55% of the country in favour of the war – the same number that voted to leave the EU.

A slim majority of the country – and the government – got it wrong a decade ago. We are not infallible. Sometimes, we all take the wrong steps and put ourselves on the wrong side of history. Now is a chance to put the brakes on and decide quite what we are doing, to move slowly and contemplatively, before continuing further.

I urge you to vote against this bill.

Open access and the Internet Archive

Late last year, I wanted to find out when the first article was published by F1000 Research. I idly thought, oh, rather than try and decipher their URLs or click “back” through their list of articles fifty times, I’ll go and look at the Internet Archive. To my utter astonishment, they’re not on it. From their robots.txt, buried among a list of (apparently) SEO-related crawler blocks –

User-agent: archive.org_bot
Disallow: /

The Internet Archive is well-behaved, and honours this restriction. Good for them. But putting the restriction there in the first place is baffling – surely a core goal of making articles open-access is to enable distribution, to ensure content is widely spread. And before we say “but of course F1000 won’t go away”, it is worth remembering that of 250 independently-run OA journals in existence in 2002, 40% had ceased publishing by 2013, and almost 10% had disappeared from the web entirely (see Björk et al 2016, table 1). Permanence is not always predictable, and backups are cheap.

Their stated backup policy is that articles (and presumably reviews?) are stored at PMC, Portico, and in the British Library. That’s great. But that’s just the articles. Allowing the IA to index the site content costs nothing, it provides an extra backup, and it ensures that the “context” of the journal – authorial instructions, for example, or fees – remains available. This can be very important for other purposes – I couldn’t have done my work on Elsevier embargoes without IA copies of odd documents from their website, for example.

And… well, it’s a bit symbolic. If you’re making a great thing of being open, you should take that to its logical conclusion and allow people to make copies of your stuff. Don’t lock it away from indexing and crawling. PLOS One have Internet Archive copies. So do Nature Communications, Scientific Reports, BMJ Open, Open Library of the Humanities, PeerJ. In fact, every prominent all-OA title I’ve checked happily allows this. Why not F1000? Is it an oversight? A misunderstanding? I find it hard to imagine it would be a deliberate move on their part…

History of Parliament and Wikidata – the first round complete

Back in January, I wrote up some things I was aiming to do this year, including:

Firstly, I’d like to clear off the History of Parliament work on Wikidata. I haven’t really written this up yet (maybe that’s step 1.1) but, in short, I’m trying to get every MP in the History of Parliament database listed and crossreferenced in Wikidata. At the moment, we have around 5200 of them listed, out of a total of 22200 – so we’re getting there. (Raw data here.) Finding the next couple of thousand who’re listed, and mass-creating the others, is definitely an achievable task.

Well, seven months later, here’s where it stands:

  • 9,372 of a total 21,400 (43.7%) of History of Parliament entries been matched to records for people in Wikidata.
  • These 9,372 entries represent 7,257 people – 80 have entries in three HoP volumes, and 1,964 in two volumes. (This suggests that, when complete, we will have about ~16,500 people for those initial 21,400 entries – so maybe we’re actually over half-way there).
  • These are crossreferenced to a lot of other identifiers. 1,937 of our 7,257 people (26.7%) are in the Oxford Dictionary of National Biography, 1,088 (15%) are in the National Portrait Gallery database, and 2,256 (31.1%) are linked to their speeches in the digital edition of Hansard. There is a report generated each night crosslinking various interesting identifiers.
  • Every MP in the 1820-32 volume (1,367 of them) is now linked and identified, and the 1790-1820 volume is now around 85% complete. (This explains the high showing for Hansard, which covers 1805 onwards)
  • The metadata for these is still limited – a lot more importing work to do – but in some cases pretty decent; 94% of the 1820-32 entries have a date of death, for example.

Of course, there’s a lot more still to do – more metadata to add, more linkages to make, and so on. It still does not have any reasonable data linking MPs to constituencies, which is a major gap (but perhaps one that can be filled semi-automatically using the HoP/Hansard links and a clever script).

But as a proof of concept, I’m very happy with it. Here’s some queries playing with the (1820-32) data:

  • There are 990 MPs with an article about them in at least one language/WM project. Strikingly, ten of these don’t have an English Wikipedia article (yet). The most heavily written-about MP is – to my surprise – David Ricardo, with articles in 67 Wikipedias. (The next three are Peel, Palmerston, and Edward Bulwer-Lytton).
  • 303 of the 1,367 MPs (22.1%) have a recorded link to at least one other person in Wikidata by a close family relationship (parent, child, spouse, sibling) – there are 803 links, to 547 unique people – 108 of whom are also in the 1820-32 MPs list, and 439 of whom are from elsewhere in Wikidata. (I expect this number to rise dramatically as more metadata goes in).
  • The longest-surviving pre-Reform MP (of the 94% indexed by deathdate, anyway) was John Savile, later Earl of Mexborough, who made it to August 1899…
  • Of the 360 with a place of education listed, the most common is Eton (104), closely followed by Christ Church, Oxford (97) – there is, of course, substantial overlap between them. It’s impressive to see just how far we’ve come. No-one would ever expect to see anything like that for Parliament today, would we.
  • Of the 1,185 who’ve had first name indexed by Wikidata so far, the most popular is John (14.4%), then William (11.5%), Charles (7.5%), George (7.4%), and Henry (7.2%):

  • A map of the (currently) 154 MPs whose place of death has been imported:

All these are of course provisional, but it makes me feel I’m definitely on the right track!

So, you may be asking, what can I do to help? Why, thankyou, that’s very kind…

  • First of all, this is the master list, updated every night, of as-yet-unmatched HoP entries. Grab one, load it up, search Wikidata for a match, and add it (property P1614). Bang, one more down, and we’re 0.01% closer to completion…
  • It’s not there? (About half to two thirds probably won’t be). You can create an item manually, or you can set it aside to create a batch of them later. I wrote a fairly basic bash script to take a spreadsheet of HoP identifiers and basic metadata and prepare it for bulk-item-creation on Wikidata.
  • Or you could help sanitise some of the metadata – here’s some interesting edge cases:
    • This list is ~680 items who probably have a death date (the HoP slug ends in a number), but who don’t currently have one in Wikidata.
    • This list is ~540 people who are titled “Honourable” – and so are almost certainly the sons of noblemen, themselves likely to be in Wikidata – but who don’t have a link to their father. This list is the same, but for “Lord”, and this list has all the apparently fatherless men who were the 2nd through 9th holders of a title…

Open questions about the costs of the scholarly publishing system

Stuart Lawson (et al)’s new paper on “Opening the Black Box of Scholarly Communication Funding” is now out – it’s an excellent contribution to the discussion and worth a read.

From their conclusion:

The current lack of publicly available information concerning financial flows around scholarly communication systems is an obstacle to evidence-based policy-making – leaving researchers, decision-makers and institutions in the dark about the implications of current models and the resources available for experimenting with new ones.

It prompts me to put together a list I’ve been thinking about for a while – what do we still need to know about the scholarly publishing market?

  • What are the actual totals of author/institutional payments to publishers outside of subscriptions and APCs – page charges, colour charges, submission fees, and so on? I have recently estimated that for the UK this is on the order of a few million pounds per year, but that’s very provisional, and doesn’t include things like reprint payments or delve into the different local practices. All we can say for sure at this stage is “yes, it’s still non-trivial, more work needed”.
  • What are the overall amounts paid by readers to publishers and aggregators for pay-per-view articles? In 2011 I found that (for JSTOR at least) the numbers are vanishingly small. I’ve not seen much other investigation of this, surprisingly – or have I just missed it?
  • Can an overall value be put on the collective “journal support” costs – for example, subsidies from a scholarly society or institution to keep their journal afloat, or grants from funding bodies directly for operating journals? This money fills a gap between subscriptions and publication costs, and is essential to keep many journals operating, but is often skimmed over.
  • How closely do quoted APC prices reflect actual costs paid? After currency fluctuation, VAT, and sometimes offset membership discounting, these can vary widely, which can make it very difficult to anticipate the actual amount which will be invoiced. (A special prize for demonstrating the point here goes to the unnamed publisher who invoices in Euro for a list price in USD, and including annotations showing a GBP tax calculation). Reporting tends to be based on actual price paid, which helps, but a lot of policy and theory is based on list-price estimates.
  • How are double-dipping/hybrid offsetting systems working out, now they’ve had a couple of years to bed in? There has been quite a bit of discussion looking at the top-level figures (total subscriptions paid plus total APCs paid) which suggests that the answer is “total amounts paid are still rising”, which is probably correct. However, there’s very little looking in detail at per-journal costs, how the offsets (if any) are calculated, and whether or not the mechanisms used make sense given the relatively low number of hybrid articles in any given journal. Work here could help come up with a standard way of calculating offsets, which could be used in future negotiations. Hybrids won’t be going away any time soon…
  • What contribution to the subscription\publishing charges market comes from outside academia? We tend to focus on university payments (as these are both substantial and reasonably well-documented) but there are very large markets for subscription academic material in, for example, medicine, scientific industry, and law. These are not well understood.

And, finally, the big one:

  • How much does it cost (indirectly/implicitly) to maintain the current subscription-based system? We have a decent idea of how much the indirect costs of gold/green open access are, thanks to recent work on the ‘total cost of publication’, but no idea of the indirect costs of the status quo. And we really, really need to figure it out.

To illustrate that last point, and why I think it’s important…

A large number of librarians (and others) spend much of their time maintaining access systems, handling subscription payments, negotiating usage agreements, fixing user access problems, and so on. Then the publishers themselves have to pay staff to develop and maintain these systems, handle negotiations, deal with payments, etc. Centralised services like JISC’s collective negotiation mean more labour, and some centralised services like ATHENS can be surprisingly expensive to use.

Let’s make a wild guess that it comes down to one FTE staff member per university (it probably isn’t that much work for Chester, but it’s a lot more for Cambridge, so it might balance out); that’s about 130 in the UK. Ten more for all the non-university institutions. Five more for the central services. Five each at the five biggest publishers and another ten for all the others. Total – for our wild estimate – 180 FTE staff. (While the publisher staff aren’t paid by the universities, they’re ultimately paid out of the cost of subscriptions, and so it’s reasonable to consider them part of the overall system cost.)

This number compares interestingly with the 192 FTE that it was estimated would be needed to deal with the administration of making all 140,000 UK research papers gold OA – they’re certainly in the same ballpark, given the wide margins of error. It has substantial implications for any “just switch everything”-type proposals, for obvious reasons, but would also be a very interesting result in and of itself.

Lee of Portrush: an introduction

One of the projects I’ve been meaning to get around to for a while is scanning and dating a boxful of old cabinet photographs and postcards produced by Lee of Portrush in the late nineteenth and early twentieth century.

At least five members and three generations of the Lee family worked as professional photographers in this small Northern Irish town – the last of them was my grandfather, William Lee, who carried the business on into the 1970s. Their later output doesn’t turn up much – I don’t think I’ve run across anything post-1920s – but a steady trickle of their older photographs appear on ebay and on family history sites. They produced a range of monochrome and coloured postcards of Portrush and the surrounding area, did a good trade in portrait photographs, and at one point ended up proprietors of (both temperance and non-temperance) hotels. Briefly, one brother decamped to South Africa (before deciding to come home again) and they proudly announced “Portrush, Coleraine, and Cape Town” – a combination rarely encountered. A more unusual line of work, however, was that they had a studio at the Giant’s Causeway.

The Causeway is the only World Heritage Site in Northern Ireland, and was as popular a tourist attraction then as now. A narrow-gauge electric tramline was built out from Portrush to Bushmills and then the Causeway in the 1880s, bringing in a sharp increase in visitors. And – because the Victorians were more or less the same people as we are now – they decided there was no better way to respond to a wonder of the natural world than to have your photograph taken while standing on it, so that you can show it to all your friends. Granted, you had to pay someone to take the photo, sit still with a rictus grin, then wait for them to faff around with wet plates and developer; not quite an iPhone selfie, but the spirit is the same even if the subjects were wearing crinolines. There is nothing new in this world.

The Lees responded cheerfully to this, and in addition to the profitable postcard trade, made a great deal of money by taking photographs of tourists up from Belfast or Dublin, or even further afield. (They then lost it again over the years; Portrush was not a great place for long-term investment once holidays to the Mediterranean became popular.)

Many of these are sat in shoeboxes; some turn up occasionally on eBay, where I buy them if they’re a few pounds. It’s a nice thing to have, since so little else survives of the business. One problem is that very few are clearly dated, and as all parts of the family seem to have used “Lees Studio”, or a variant, it’s not easy to put them in order, or to give a historical context. For the people who have these as genealogical artefacts, this is something of a problem – ideally, we’d be able to say that this particular card style was early, 1880-1890, that address was later, etc., to help give some clues as to when it was taken.

Fast forward a few years. Last November, I had an email from John Kavanaugh, who’d found a Lee photograph of his great-great-grandfather (John Kavanagh, 1822-1904), and managed to recreate the scene on a visit to the Causeway:

Family resemblance, 1895-2015
Courtesy John Kavanaugh/Efren Gonzalez

It’s quite striking how similar the two are. The stone the elder John was sat on has now crumbled, fallen, or been moved, but the rock formations behind him are unchanged. The original photo is dated c. 1895, so this covers a hundred and twenty years and five generations.

So, taking this as a good impetus to get around to the problem, I borrowed a scanner yesterday and set to. Fifty-odd photographs later, I’ve updated the collection on flickr, and over the next few posts I’ll try and draw together some notes on how to date them.

(Addendum: all comments below will be replied to by email if possible! I am always delighted to see new photographs from the Lees, and will see what I can do to help you date them. Many thanks for all the comments, and please do get in touch.)

Android preinstalls – a ticking timebomb

So, I got a push notification on my phone today from “Peel Smart Remote”. Never heard of it. This turns out to be one of those applications for people who really need to use their phone as a TV remote; a bit pointless, but hey, I’m sure someone thinks it’s a great idea.

I don’t own a TV, so unsurprisingly, I’m not one of those people. The app turned out to be pre-installed on my phone (originally under a different name), and is undeleteable – but I can “disable” it and delete any data it had recorded. (Data they should, of course, not have, but trying to tell American startups about privacy is like trying to explain delayed gratification to a piranha, so let’s not even go there.)

I then went through my phone’s app list looking for the other junk like this. Four, all with pre-approved push applications, all of which now disabled. (I’m leaving aside the pre-installed ones which I might actually want to use…)

But when I removed them, I happened to scroll down and look at permissions. The Peel app, which has been running quietly in the background for about two years, has had an astonishing range of permissions.

* read contact data (giving the ability to know personal details of anyone stored as a contact – along with metadata about when and how I contact them)
* create calendar events and email guests without my awareness
* read and write anything stored on the SD card
* full internet access

Let’s not even ask why a TV remote would need the ability to find out who all my contacts are.

The others were not much better. Blurb (a small print-on-demand publishing firm) could read my data and find out who was calling me. Flipboard (a social-media aggregator) could read my data. And “ChatON“, which seems to be some kind of now-defunct messaging service run by Samsung; its app could call people, record audio, take pictures, find my location, read all my data (and my contact data), create accounts, shut down other applications, force the phone to remain active – basically every permission in the book. Again, that’s been burbling away for two years. Always on, starting on launch, and… what?.

Now, I’ll be fair here – it’s unlikely that a startup like Peel has a business plan that involves “gather a load of personal data and sell it”. But how could I know for sure? It’s hardly an unknown approach out there. And on reflection, maybe it’s not their business plan we need to worry about.

Let’s imagine a startup made something like ChatON. They get widespread ‘adoption’ (by paying for preinstalls), but ultimately it doesn’t take off. They fail – as ChatON did – but without the ability of a large corporation to write it off as a failure and file it away, the residue of the company and its assets are sold for some trivial sum to whoever turns up.

Their assets that include a hundred million always-on apps on phones worldwide, with security permissions to record everything and transmit, and preapproved automatic updates.

If you’re not grimacing at that, you haven’t thought about it enough.

This is one thing that Apple have got right – very little preinstalled that isn’t from the manufacturer directly. Maybe I could switch to an iPhone, or maybe it’s time to finally think about Cyanogen.

But that’d fix it for me. The underlying systemic risk is still there… and one day we’re all going to get burned. Preinstalled third party apps with broad permissions are a time-bomb and the phone manufactures should probably think hard about their (legal and reputational) liability.

Shifting of the megajournal market

One of the most striking developments in the last ten years of scholarly publishing, outside of course open access, was the rise of the “megajournal” – an online-only journal with a very broad remit, no arbitrary size limits, and a low threshold for inclusion.

For many years, the megajournal was more or less synonymous with PLOS One, which peaked in 2013-14 with around 32,000 papers per year, an unprecedented number. The journal began to falter a little in early 2014, and showed a substantial decline in 2015, dropping to a mere (!) 26,000 papers.

One commentator highlighted a point I found very interesting: while PLOS One was shrinking, other megajournals were taking up the slack. The two highlighted here were Scientific Reports (Nature) and RSC Advances (Royal Society of Chemistry) – no others have grown to quite the same extent.

We’re now a month into 2016, and it looks like this trend has continued – and much more dramatically than I expected. Here’s the relative article numbers for the first five weeks of 2016, measured through three different sources – the journals own sites; Scopus; and Web of Science.


The journal sites are probably the most accurate measure for what’s been published as of today, and unsurprisingly has the largest number of papers (5965 total). Here we see three similar groups – PLOS One 38%, Scientific Reports 31%, and RSC Advances 31%. (The RSC Advances figure has been adjusted to remove about 450 “accepted manuscripts” nominally dated 2016 – while publicly available, these are simply posted earlier in the process than the other journals would do, and so including them would give an inflated estimate of the numbers actually being published)

Scopus and Web of Science return smaller numbers (2766 and 3499 papers respectively) and show quite divergent patterns – PLOS One is on 36% in Scopus and 52% in Web of Science, with Scientific Reports on 42% and 37%, and RSC Advances on 22% and 11%. It’s not much of a surprise that the major databases are relatively slow to update, though it’s interesting to see that they update different journals at different rates. Scopus is the only one of the three sources to suggest that PLOS One is no longer the largest journal – but for how long?

Whichever source we use, it seems clear that PLOS One is now no longer massively dominant. There’s nothing wrong with that, of course – in many ways, having two or three competing but comparable large megajournals will be a much better situation than simply having one. And I won’t try and speculate on the reasons (changing impact factor? APC cost? Turnaround time? Shifting fashions?)

It will be very interesting to look at these numbers again in two or three months…

Projects and plans

15th January. A bit late for New Year’s resolutions, and I’m never much of a one for them anyway.

Still, it’s a good time to take stock. What am I hoping to achieve this year? I have omitted the personal aims, as they’re not of great interest to anyone who’s not me, but otherwise, hopefully without overcommitting myself…


2015 was pretty good. I planned a rather complex library move (twice, after the first time was delayed, which is a good way to learn from your mistakes without having actually had to commit them). Two weeks into the new year and ~400 metres of books shifted, it’s looking like it’s actually working, so let’s call that one a conditional success. First order of business: finish it off. And write up some notes on it so that others may learn to not do as I have done.

Secondly, get something published again. I had my first ‘proper’ academic publication in late 2015, and though it’s on a topic that approximately three people care about, I’m still glad it’s done and out there. (I have something to point at next time I’m glibly assured “oh, that approach never happens any more”. This is a recurrent theme in discussions about scholarly publishing; but I digress.) I would recommend it to any academic librarian as an exercise in understanding what your researchers suffer.

(I have a couple of projects on the boil which I’d like to write up properly, of which more anon.)

Thirdly, finish putting together the papers from the 2014 Polar Libraries Colloquy. Call this a public admittance of dragging my heels about this.

Lastly, consider Chartership. I’ve avoided this for many years, seeing it as a rather daunting pile of paperwork, but it’s probably a sensible thing to think about.


Firstly, I’d like to clear off the History of Parliament work on Wikidata. I haven’t really written this up yet (maybe that’s step 1.1) but, in short, I’m trying to get every MP in the History of Parliament database listed and crossreferenced in Wikidata. At the moment, we have around 5200 of them listed, out of a total of 22200 – so we’re getting there. (Raw data here.) Finding the next couple of thousand who’re listed, and mass-creating the others, is definitely an achievable task.

Secondly, and building on this, I did some work in the autumn of 2015 on building a framework for linking EveryPolitician and Wikidata. I need to pick this back up and work out how we can best represent politicians in general – what are the best data structures for things like constituencies, parliamentary terms, parties?

This leads into the third project, which is the general use of Wikidata as a “biographical spine”. Charles Matthews, Magnus Manske, and I have been working on this for a couple of years, and it really is beginning to bear fruit. We’re working to pull together as many large biographical databases as possible, and have them talking to one another through Wikidata, so that we can start bringing data and links from one to the users of another. This certainly won’t ever be completed in 2015 – but it would be good to write some of it up in a single report so that it’s clear what we’re doing, and hopefully start advertising it to researchers who could benefit.

Fourthly (oh, goodness), the Oxford Dictionary of National Biography. This is a project I embarked on back in 2013; the goal is to get a reliable crossreference between Wikipedia/Wikidata and the ODNB – now complete, mainly thanks to Charles Matthews – and then to fix all the vague unhelpful “see DNB” Wikipedia citations into nicely formatted, linkable ones, which readers can actually benefit from. This second part is going to take a long time, but I’ve made some rudimentary attempts at auto-predicting the required citations to be fixed by hand, and hopefully we’ll get there in time.

Moving away from Wikidata, early last year I started on what has turned into the Birthdays Project – an attempt to study the way in which people misremembered their birthdays when they’re not well-documented. This is generally known and the basic result is kind of obvious, but it has only been (very cursorily) discussed in the academic literature before, and I don’t think anyone’s properly attacked it with substantial data, multiple cultural contexts, etc. I wrote up a few notes on this in early 2015 (part 1, part 2), but since then I’ve nailed down some more data, figured out a useful way of visualising it, and so on. No idea if it’s publishable per se, but it would be good to have it written up.

That… looks like a busy year ahead.

Finally, going places and doing things. I have a couple of long-awaited holidays planned, and some people I’m looking forward to seeing on them. I will be going to the Polar Libraries Colloquy in Alaska, but I won’t be going to Wikimania in June – I’ll be elsewhere, sadly. I’m sad to miss this year, as it looks to be an excellent event.

Most popular videos on Wikipedia, 2015

One of the big outstanding questions for many years with Wikipedia was the usage data of images. We had reasonably good data for article pageviews, but not for the usage of images – we had to come up with proxies like the number of times a page containing that image was loaded. This was good enough as it went, but didn’t (for example) count the usage of any files hotlinked elsewhere.

In 2015, we finally got the media-pageviews database up and running, which means we now have a year’s worth of data to look at. In December, someone produced an aggregated dataset of the year to date, covering video & audio files.

This lists some 540,000 files, viewed an aggregated total of 2,869 million times over about 340 days – equivalent to 3,080 million over a year. This covers use on Wikipedia, on other Wikimedia projects, and hotlinked by the web at large. (Note that while we’re historically mostly concerned with Wikipedia pageviews, almost all of these videos will be hosted on Commons.) The top thirty:

14436640 President Obama on Death of Osama bin Laden.ogv
10882048 Bombers of WW1.ogg
10675610 20090124 WeeklyAddress.ogv
10214121 Tanks of WWI.ogg
9922971 Robert J Flaherty – 1922 – Nanook Of The North (Nanuk El Esquimal).ogv
9272975 President Obama Makes a Statement on Iraq – 080714.ogg
7889086 Eurofighter 9803.ogg
7445910 SFP 186 – Flug ueber Berlin.ogv
7127611 Ward Cunningham, Inventor of the Wiki.webm
6870839 A11v 1092338.ogg
6865024 Ich bin ein Berliner Speech (June 26, 1963) John Fitzgerald Kennedy trimmed.theora.ogv
6759350 Editing Hoxne Hoard at the British Museum.ogv
6248188 Dubai’s Rapid Growth.ogv
6212227 Wikipedia Edit 2014.webm
6131081 Newman Laugh-O-Gram (1921).webm
6100278 Kennedy inauguration footage.ogg
5951903 Hiroshima Aftermath 1946 USAF Film.ogg
5902851 Wikimania – the Wikimentary.webm
5692587 Salt March.ogg
5679203 CITIZENFOUR (2014) trailer.webm
5534983 Reagan Space Shuttle Challenger Speech.ogv
5446316 Medical aspect, Hiroshima, Japan, 1946-03-23, 342-USAF-11034.ogv
5434404 Physical damage, blast effect, Hiroshima, 1946-03-13 ~ 1946-04-08, 342-USAF-11071.ogv
5232118 A Day with Thomas Edison (1922).webm
5168431 1965-02-08 Showdown in Vietnam.ogv
5090636 Moon transit of sun large.ogg
4996850 President Kennedy speech on the space effort at Rice University, September 12, 1962.ogg
4983430 Burj Dubai Evolution.ogv
4981183 Message to Scientology.ogv

(Full data is here; note that it’s a 17 MB TSV file)

It’s an interesting mix – and every one of the top 30 is a video, not an audio file. I’m not sure there’s a definite theme there – though “public domain history” does well – but it’d reward further investigation…