I’ve been doing a lot of bibliometric work recently. One task that I bounced off a few times before figuring out a practical approach was statistics on first authors; since I’ve finally figured it out, it seemed worth making a note of it. This uses Scopus and some very basic bash shell commands.
Let’s say we want to find out what proportion of papers from the University of York in 2007 had York-affiliated first authors. At first glance, this is a simple problem – Web of Science or Scopus will give you a list of affiliations for each paper, and as far as I know they’re listed in order of appearance; so download that, sort it, count all the ones that start with York, you’re done.
Unfortunately, you get people with double affiliations. Are there enough of them to be significant? For a small institution, quite possibly. It means we can’t use Web of Science, as their data – while wonderfully sorted and deduplicated against institutions – just says “this paper has these affiliations”.
Scopus, however, associates affiliations to authors. This means that you can reliably pick any given author for a paper and report what their affiliations are. (It also means that you can do some weighting – five authors from X and one from Y may not be the same as one from X and five from Y in your particular scenario).
Log into Scopus, run your search. Export the results, making sure to select “Affiliations” from the menu, and filetype CSV. It does not work well with sets of more than 2000 papers, so you may want to do some careful subdivision of your query. Thankfully, our example has 1848 results…
The result is a bit messy, because CSVs… well, they’re messy. Let’s convert it into a nice TSV. Create a file to contain this very short python script:
import csv, sys
cat scopus.csv | ./csv2tsv > scopus.tsv
Occasionally you can get papers with ludicrous numbers of authors, all of whom have their affiliations in a single field, and trying to import this into a spreadsheet gets messy – I think the record I had was something like 44k of text in a single name/affiliation field. So we’ll do this all from the command line.
First off, let’s check the file is the right length.
wc -l scopus.tsv should give 1849 – one greater than the expected total because we still have a header column.
Now then, let’s look at the author/affiliation field.
cut -f 15 scopus.tsv will extract this. The thing to note here is that the individual authors are separated by semicolons, while multiple affiliations for the same author are separated only by commas. So if we want to extract the first author, all we need to do is extract everything before the first semicolon –
cut -f 15 scopus.tsv | cut -f 1 -d \;
Now, we want to find out how many of those match York.
cut -f 15 scopus.tsv | cut -f 1 -d \; | grep "University of York" will find all those with the university name in the extracted affiliation; we can count the lines with
cut -f 15 scopus.tsv | cut -f 1 -d \; | grep "University of York" | wc -l
Hey presto – 1200 exactly. Of our 1848 papers in 2007, 1200 (65%) had a York-based first author.
Wait, you cry, that sounds a pretty impressive number – but how many of those were single-authored papers? We can answer that, too. The first field simply seperates all authors with commas, so any author field with commas must have multiple authors.
cut -f 1 scopus.tsv | grep \, | wc -l – and we get 1511.
So of the 1848 papers York published that year, 337 were single-authored. Of the remaining 1511, 863 (57%) were led by York authors.
And while we’re on that topic – how many authors were there, on average? Again, our friend the comma steps in.
cut -f 1 scopus.tsv | sed 's/\,/\n/g' | wc -l switches every comma for a linebreak, so each author on each paper gets a new line, then counts the results. 8384 – except as you’ll remember we still have a header row, and it will be included in the results because there’s no
grep to filter it out, so 8383. Across 1848 papers, that’s an average of 4.5 authors/paper.
Now, the big caveat. Affiliations are free-text addresses. They are entered more or less as on the original paper, so if someone makes a really silly mistake and gets entire bits of their institution name wrong, this may end up perpetuated in the database. There is some standardisation, but it’s not perfect – five 2007 papers turn out to match “univ. of york” but not “university of york”, and so did not make it into our search data. Five of the “University of York” affiliations, on close examination, turn out to match the Canadian York not the British one. So you need to be cautious. But the broad results are certainly good enough to be going on with!