Open access and the Internet Archive

Late last year, I wanted to find out when the first article was published by F1000 Research. I idly thought, oh, rather than try and decipher their URLs or click “back” through their list of articles fifty times, I’ll go and look at the Internet Archive. To my utter astonishment, they’re not on it. From their robots.txt, buried among a list of (apparently) SEO-related crawler blocks –

User-agent: archive.org_bot
Disallow: /

The Internet Archive is well-behaved, and honours this restriction. Good for them. But putting the restriction there in the first place is baffling – surely a core goal of making articles open-access is to enable distribution, to ensure content is widely spread. And before we say “but of course F1000 won’t go away”, it is worth remembering that of 250 independently-run OA journals in existence in 2002, 40% had ceased publishing by 2013, and almost 10% had disappeared from the web entirely (see Björk et al 2016, table 1). Permanence is not always predictable, and backups are cheap.

Their stated backup policy is that articles (and presumably reviews?) are stored at PMC, Portico, and in the British Library. That’s great. But that’s just the articles. Allowing the IA to index the site content costs nothing, it provides an extra backup, and it ensures that the “context” of the journal – authorial instructions, for example, or fees – remains available. This can be very important for other purposes – I couldn’t have done my work on Elsevier embargoes without IA copies of odd documents from their website, for example.

And… well, it’s a bit symbolic. If you’re making a great thing of being open, you should take that to its logical conclusion and allow people to make copies of your stuff. Don’t lock it away from indexing and crawling. PLOS One have Internet Archive copies. So do Nature Communications, Scientific Reports, BMJ Open, Open Library of the Humanities, PeerJ. In fact, every prominent all-OA title I’ve checked happily allows this. Why not F1000? Is it an oversight? A misunderstanding? I find it hard to imagine it would be a deliberate move on their part…

Open questions about the costs of the scholarly publishing system

Stuart Lawson (et al)’s new paper on “Opening the Black Box of Scholarly Communication Funding” is now out – it’s an excellent contribution to the discussion and worth a read.

From their conclusion:

The current lack of publicly available information concerning financial flows around scholarly communication systems is an obstacle to evidence-based policy-making – leaving researchers, decision-makers and institutions in the dark about the implications of current models and the resources available for experimenting with new ones.

It prompts me to put together a list I’ve been thinking about for a while – what do we still need to know about the scholarly publishing market?

  • What are the actual totals of author/institutional payments to publishers outside of subscriptions and APCs – page charges, colour charges, submission fees, and so on? I have recently estimated that for the UK this is on the order of a few million pounds per year, but that’s very provisional, and doesn’t include things like reprint payments or delve into the different local practices. All we can say for sure at this stage is “yes, it’s still non-trivial, more work needed”.
  • What are the overall amounts paid by readers to publishers and aggregators for pay-per-view articles? In 2011 I found that (for JSTOR at least) the numbers are vanishingly small. I’ve not seen much other investigation of this, surprisingly – or have I just missed it?
  • Can an overall value be put on the collective “journal support” costs – for example, subsidies from a scholarly society or institution to keep their journal afloat, or grants from funding bodies directly for operating journals? This money fills a gap between subscriptions and publication costs, and is essential to keep many journals operating, but is often skimmed over.
  • How closely do quoted APC prices reflect actual costs paid? After currency fluctuation, VAT, and sometimes offset membership discounting, these can vary widely, which can make it very difficult to anticipate the actual amount which will be invoiced. (A special prize for demonstrating the point here goes to the unnamed publisher who invoices in Euro for a list price in USD, and including annotations showing a GBP tax calculation). Reporting tends to be based on actual price paid, which helps, but a lot of policy and theory is based on list-price estimates.
  • How are double-dipping/hybrid offsetting systems working out, now they’ve had a couple of years to bed in? There has been quite a bit of discussion looking at the top-level figures (total subscriptions paid plus total APCs paid) which suggests that the answer is “total amounts paid are still rising”, which is probably correct. However, there’s very little looking in detail at per-journal costs, how the offsets (if any) are calculated, and whether or not the mechanisms used make sense given the relatively low number of hybrid articles in any given journal. Work here could help come up with a standard way of calculating offsets, which could be used in future negotiations. Hybrids won’t be going away any time soon…
  • What contribution to the subscription\publishing charges market comes from outside academia? We tend to focus on university payments (as these are both substantial and reasonably well-documented) but there are very large markets for subscription academic material in, for example, medicine, scientific industry, and law. These are not well understood.

And, finally, the big one:

  • How much does it cost (indirectly/implicitly) to maintain the current subscription-based system? We have a decent idea of how much the indirect costs of gold/green open access are, thanks to recent work on the ‘total cost of publication’, but no idea of the indirect costs of the status quo. And we really, really need to figure it out.

To illustrate that last point, and why I think it’s important…

A large number of librarians (and others) spend much of their time maintaining access systems, handling subscription payments, negotiating usage agreements, fixing user access problems, and so on. Then the publishers themselves have to pay staff to develop and maintain these systems, handle negotiations, deal with payments, etc. Centralised services like JISC’s collective negotiation mean more labour, and some centralised services like ATHENS can be surprisingly expensive to use.

Let’s make a wild guess that it comes down to one FTE staff member per university (it probably isn’t that much work for Chester, but it’s a lot more for Cambridge, so it might balance out); that’s about 130 in the UK. Ten more for all the non-university institutions. Five more for the central services. Five each at the five biggest publishers and another ten for all the others. Total – for our wild estimate – 180 FTE staff. (While the publisher staff aren’t paid by the universities, they’re ultimately paid out of the cost of subscriptions, and so it’s reasonable to consider them part of the overall system cost.)

This number compares interestingly with the 192 FTE that it was estimated would be needed to deal with the administration of making all 140,000 UK research papers gold OA – they’re certainly in the same ballpark, given the wide margins of error. It has substantial implications for any “just switch everything”-type proposals, for obvious reasons, but would also be a very interesting result in and of itself.

Shifting of the megajournal market

One of the most striking developments in the last ten years of scholarly publishing, outside of course open access, was the rise of the “megajournal” – an online-only journal with a very broad remit, no arbitrary size limits, and a low threshold for inclusion.

For many years, the megajournal was more or less synonymous with PLOS One, which peaked in 2013-14 with around 32,000 papers per year, an unprecedented number. The journal began to falter a little in early 2014, and showed a substantial decline in 2015, dropping to a mere (!) 26,000 papers.

One commentator highlighted a point I found very interesting: while PLOS One was shrinking, other megajournals were taking up the slack. The two highlighted here were Scientific Reports (Nature) and RSC Advances (Royal Society of Chemistry) – no others have grown to quite the same extent.

We’re now a month into 2016, and it looks like this trend has continued – and much more dramatically than I expected. Here’s the relative article numbers for the first five weeks of 2016, measured through three different sources – the journals own sites; Scopus; and Web of Science.


The journal sites are probably the most accurate measure for what’s been published as of today, and unsurprisingly has the largest number of papers (5965 total). Here we see three similar groups – PLOS One 38%, Scientific Reports 31%, and RSC Advances 31%. (The RSC Advances figure has been adjusted to remove about 450 “accepted manuscripts” nominally dated 2016 – while publicly available, these are simply posted earlier in the process than the other journals would do, and so including them would give an inflated estimate of the numbers actually being published)

Scopus and Web of Science return smaller numbers (2766 and 3499 papers respectively) and show quite divergent patterns – PLOS One is on 36% in Scopus and 52% in Web of Science, with Scientific Reports on 42% and 37%, and RSC Advances on 22% and 11%. It’s not much of a surprise that the major databases are relatively slow to update, though it’s interesting to see that they update different journals at different rates. Scopus is the only one of the three sources to suggest that PLOS One is no longer the largest journal – but for how long?

Whichever source we use, it seems clear that PLOS One is now no longer massively dominant. There’s nothing wrong with that, of course – in many ways, having two or three competing but comparable large megajournals will be a much better situation than simply having one. And I won’t try and speculate on the reasons (changing impact factor? APC cost? Turnaround time? Shifting fashions?)

It will be very interesting to look at these numbers again in two or three months…

Page and colour charges: they’re still a thing

So, I have a paper out! Very exciting – this is my first ‘proper’ academic publication (and it came out the day after my birthday, so there’s that, too.)

Gray, Andrew (2015). Considering Non-Open Access Publication Charges in the “Total Cost of Publication”. Publications 2015, 3(4), 248-262; doi:10.3390/publications3040248

Recent research has tried to calculate the “total cost of publication” in the British academic sector, bringing together the costs of journal subscriptions, the article processing charges (APCs) paid to publish open-access content, and the indirect costs of handling open-access mandates. This study adds an estimate for the other publication charges (predominantly page and colour charges) currently paid by research institutions, a significant element which has been neglected by recent studies. When these charges are included in the calculation, the total cost to institutions as of 2013/14 is around 18.5% over and above the cost of journal subscriptions—11% from APCs, 5.5% from indirect costs, and 2% from other publication charges. For the British academic sector as a whole, this represents a total cost of publication around £213 million against a conservatively estimated journal spend of £180 million, with non-APC publication charges representing around £3.6 million. A case study is presented to show that these costs may be unexpectedly high for individual institutions, depending on disciplinary focus. The feasibility of collecting this data on a widespread basis is discussed, along with the possibility of using it to inform future subscription negotiations with publishers.

The problem

So what’s this all about, then?

We (in the UK particularly) have spent a lot of effort trying to reduce the cost of the scholarly publishing system, which is remarkably high; British university libraries collectively spend £180,000,000 per year on subscriptions, comparable to the entire budget of one of the smaller research councils. The major driver here is open access – trying to make research available to read without charges – and so there has been a lot of interest in trying to arrange matters so that the costs of publishing open access don’t rise faster than the corresponding reduction in subscriptions. The general term for this is the “total cost of publication” (TCP) – ie, the costs of all the parts of the system, including both direct spending and indirect management costs (it’s surprising how much it costs to shuffle paperwork).

This is a sensible goal – it keeps the net cost under control – but the focus on OA costs and subscriptions misses out some other contributions to the balance sheet.

Historically, a lot of the cost of scholarly publishing was borne by authors or their institutions through publication charges – page charges, colour charges, submission charges, and a few other oddities. These became less common (for various reasons, and there’s an interesting history to be written) through the 1980s, and – outside of open-access article processing charges – compulsory publication charges are now rare for most journals in most fields. To many researchers (including a lot of those who’ve helped set OA policy), they simply don’t exist as a significant concern.

However, during 2013-14 it became rapidly apparent to me that my institution was spending a lot of money on page charges, which didn’t fit with what was being reported elsewhere, and didn’t fit with the general recommendations from the funding bodies on how to allocate costs. These charges were not being taken into consideration in the various TCP offsetting schemes, with the effect that we were seeing a lot of spending going direct to publishers, but outside the carefully constructed framework for controlling costs.

The study

I dug back through the recent literature on the costs of journal publishing – there had been a flurry of studies in the early 2000s as people began to work out how to handle OA costs – and tried to determine what the levels of other “publication charges” had been just before OA spending took off. It turned out to be tricky to come up with a firm estimate, but my best guess was that non-OA publication charges were around 3-5% of subscription costs in 2004-5, and had dropped since then. By now (ie 2013/14), it’s probably around 2%, assuming a continual gentle decline.

Firstly, this is quite a lot of money. If British universities spend £180,000,000 per year, then 2% is a further £3,600,000 – comparable to forty or fifty PhD studentships. It’s particularly striking when we bear in mind that this is money many institutions may not realise they are spending.

Secondly, it’s clear that the cost is distributed very erratically. My own institution spent the equivalent of 15-18% of its subscription budget on non-OA publication charges, driven mainly by very heavy page charges in certain well-used earth sciences journals. (From another angle, Frank Norman has since reported that his institution, in biomedicine, had non-OA publication charges equal to about 10% of subscriptions, and in the early 2000s it was three times that.) Given the disciplinary concentration, it’s likely that spending in universities is similarly patchy – individual departments may have dramatically higher publication costs than the overall average.

Thirdly, this spending is, currently, invisible to policymakers. Of the 29 institutions who provided article-level spending records for 3,721 papers in 2014, only fifteen individual papers could be identified as having page or colour charges (mostly at Leeds), with another ten mentioned in the general reports. Twenty-five papers is clearly not going to get us anywhere near the overall spending estimates. This data isn’t being collected centrally by RCUK/JISC – who are otherwise doing sterling work on tracking APCs – and it’s not clear if it even gets collected centrally by universities. The majority of non-OA publication charges may just disappear into the morass of “miscellaneous spending” in grant budgets.

Where next?

Firstly, we need to get a good idea of what’s actually being spent. My 2% estimate is a pretty wide one – I wouldn’t be surprised if it was 1% or 3%, or further away. The methodology we used was quite time-consuming – effectively identifying every paper with possible charges and chasing the authors to confirm – but it did work. Perhaps a better method, for larger institutions, would be sampling the departments with probable concentrations of page charges, or it might be that some institutions have robust enough finance systems that a lot of cases can be identified with a bit of research. Perhaps we can even obtain this information direct from publishers. Whatever method is used, the existing RCUK/JISC APC reporting infrastructure offers a good way to report it to a central body for aggregation, deduplication, and republication.

Secondly, we need to account for non-OA publication charges as part of the total cost of publication. They are smaller than APCs, but they are very significant for some institutions. While it may not be appropriate to use the same offsetting schemes, if they’re not brought into the equation there will be an risk that publishers are tempted to increase them dramatically – an extra revenue stream which is not capped and controlled in the way that subscriptions and APCs are. There’s no sign that anyone is doing this now – and most of the major commercial publishers no longer use page charges – but it remains a concern.

Lastly – the “more research is needed” section – there are two big questions still outstanding for the total cost of publication, even with this new element added.

  • What about the indirect costs of subscription publishing? We have a good handle on the indirect costs of running repositories and handling OA payments, but we have no idea what the infrastructure to keep a subscripton system working costs us. This might include, for example, things like – the cost of staff time to manage subscriptions; the cost of staff time to run authentication and proxy servers; the cash cost of third-party authentication services like Athens; the cost to the publishers of maintaining security barriers; the cost in wasted researcher time trying to obtain material; &c.
  • If everything is expressed as a proportion of subscription spending, how much is that? My £180,000,000 figure is an inflation-adjusted estimate, based on data from SCONUL in 2010/11. There have been more recent SCONUL surveys, but not published. A firm understanding of how much we actually spend is vital to actually make sense of these results.

Marking authorship in texts

While writing something about Wikipedia, and talking about the idea of tracable attribution of text, I’ve been thinking of ways in which works with multiple discrete authors have displayed the different contributions of those authors.

At one extreme, there’s a fully “collaborative” work – no-one makes a distinction between the two authors, and while they’re named on the title page the writing is implicitly attributed to both. At the other extreme, we have individual chapters or articles – A writes chapter 1, B writes chapter 2, etc., and they may never have known of the other contributors.

In the middle, there’s cases where the work is broadly collaborative but with individual elements – the main text is jointly written, but particular contributors sign their own footnotes, sidebar sections, forewords, appendices, etc.

The one that interests me, though, is something I saw in I.S. Shklovsky’s Intelligent Life in the Universe when I read it as a student – I seem to have lost my copy in the intervening ten years, so this is from memory.

The book was originally published in the USSR in the early 1960s, and translated and expanded in English with the aid of Carl Sagan later in the decade. The original text was updated by Sagan, who also added several new chapters; the two then shared drafts, editing “each other’s” sections. Given the political climate, however, they were keen to avoid claiming to be in agreement on some sensitive topics, and so they experimented with explicitly marking the appearance of a single voice in the text itself.

In the end, the result ran something like:

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ▲Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.▼ △Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.▽

Unmarked text was jointly written; black triangles marked remarks by one author, and white triangles by another. (At at least one point, delightfully, they started arguing.)

So, the question: was this something common in the period that I’ve just never noticed elsewhere? Is there a name for it? What other novel ways of marking authorship have been used?

The encyclopedia anyone can [be told to] edit

A moment of amusement, from the (thankfully) long-distant past:

The Great Soviet Encyclopedia, which contains more than 100,000 entries and fills fifty-one volumes, includes some distortions so flamboyant as to be beyond belief. These are an old story. But such distortions have importance […]

Almost everyone has heard about what happened to Beria in the Encyclopedia. After his liquidation, subscribers were notified, with full instructions, that they should snip out the article about him and insert in its place substitute articles which were duly enclosed, about the Bering Strait and an obscure eighteenth-century statesman named Berholtz. These were the best available substitutes beginning with ‘Ber’. During Stalin’s day when the party line changed on some matter so important that the Encyclopedia itself had to be changed, subscribers were obliged to turn in the volume affected to the party secretary; it was pulped and a new whole volume, cut and patched, was then sent out to the subscriber. Nowadays the reader is allowed to keep the book, and trusted to make the proper emendation himself. Progress!

Another person ‘expelled’ from the Encyclopedia was a Chinese Communist leader, Kao Kang. To replace him, a substitute page went out dealing with a city in Tibet. […] In their haste to make the revision, the editors overlooked the fact that the same Tibetan city also appeared elsewhere in the Encyclopaedia, spelled differently.

— John Gunther, Inside Russia Today (Penguin, 1964).

Authorial inequalities

A recent post in Charlie Stross’s series on misconceptions about publishing (more on which anon, hopefully), has an interesting side-note:

Interestingly, the researchers went on to calculate a Gini coefficient for authors’ incomes … The Gini coefficient among writers in the UK in 2004-05 was a whopping great 0.74.

I felt you could make a dramatic comparison from that, so I went to check the figures. The surprising thing is, though, Gini coefficients that high just don’t usually exist on a national level – there’s only one or two countries where we have the data to reasonably conclude it’s as high as 0.7. (Namibia, if you’re wondering). The reason for this is that rural hinterlands tend to reduce the effect of the inequalities of the cities (which are, obviously, where you find both the urban shantytowns and the wealthy metropolitan elite).

Are there, then, specific cities where it’s this bad? Yes. Again, just. The worst cities in the world, by inequality, are the major metropolises of South Africa; even there, it peaks at about 0.75. So, visualise it that way for a second: the population of people in the UK who are paid to write, full-time or part-time, has a level of economic inequality on a par with that of the population of Johannesburg.

It’s quite a staggering image, really. You realise it’s a very sharp differential, but you don’t realise it’s that steep!

Plus ca change

From a Glasgow bookseller writing to The Bookman in February 1895:

…some publishers are doing their utmost to ruin the trade by selling to the drapers, who buy large quantities at reduced prices

(The “drapers” were, of course, the large general retailers. By the 1890s, the term was about as exact as calling Sainsbury’s a grocers.)

That was not the only complaint that could have been lifted straight from last week’s Bookseller. This one from 1905 –

…[the Bookman] was quite relieved to note that recently published children’s books, though dangerously full of humour, were not so absurdly grotesque as in recent years.

Both quotes are from Booksellers and Bestsellers: British Book Sales as Documented by “The Bookman”, 1891-1906 (2001) [JSTOR], a study of the most popular books sold in Britain at the turn of the century. (There were no bestseller lists per se at the time – the bulk of the article was an attempt to retractively construct one based on returns from booksellers. It is sobering how many of them are completely forgotten…)

Amazon and Macmillan

In an interesting move, Amazon (.com, anyway) recently pulled a large number of books published by Macmillan, or its imprints; this was a reaction to a dispute over how to establish the sale & distribution conditions for ebooks.

(Basically: two big players having a game of chicken, and someone is blinking a bit later than usual. It caused… some entirely justified outcry from the people caught in the middle.)

Charlie Stross has an interesting explanation about the two duelling models of the publishing supply chain here – basically, Amazon trying to grab a slice of the cake that previously went to the publishers.