Open access and the Internet Archive

Late last year, I wanted to find out when the first article was published by F1000 Research. I idly thought, oh, rather than try and decipher their URLs or click “back” through their list of articles fifty times, I’ll go and look at the Internet Archive. To my utter astonishment, they’re not on it. From their robots.txt, buried among a list of (apparently) SEO-related crawler blocks –

User-agent: archive.org_bot
Disallow: /

The Internet Archive is well-behaved, and honours this restriction. Good for them. But putting the restriction there in the first place is baffling – surely a core goal of making articles open-access is to enable distribution, to ensure content is widely spread. And before we say “but of course F1000 won’t go away”, it is worth remembering that of 250 independently-run OA journals in existence in 2002, 40% had ceased publishing by 2013, and almost 10% had disappeared from the web entirely (see Björk et al 2016, table 1). Permanence is not always predictable, and backups are cheap.

Their stated backup policy is that articles (and presumably reviews?) are stored at PMC, Portico, and in the British Library. That’s great. But that’s just the articles. Allowing the IA to index the site content costs nothing, it provides an extra backup, and it ensures that the “context” of the journal – authorial instructions, for example, or fees – remains available. This can be very important for other purposes – I couldn’t have done my work on Elsevier embargoes without IA copies of odd documents from their website, for example.

And… well, it’s a bit symbolic. If you’re making a great thing of being open, you should take that to its logical conclusion and allow people to make copies of your stuff. Don’t lock it away from indexing and crawling. PLOS One have Internet Archive copies. So do Nature Communications, Scientific Reports, BMJ Open, Open Library of the Humanities, PeerJ. In fact, every prominent all-OA title I’ve checked happily allows this. Why not F1000? Is it an oversight? A misunderstanding? I find it hard to imagine it would be a deliberate move on their part…

Conservation science: open access might not be endangered after all

I was very struck to see this paper this morning: Fuller, R. A., J. R. Lee, and J. E. M. Watson. 2014. “Achieving open access to conservation science“. Conservation Biology 28. doi:10.1111/cobi.12346.

Conservation science is a crisis discipline in which the results of scientific enquiry must be made available quickly to those implementing management. We assessed the extent to which scientific research published since the year 2000 in 20 conservation science journals is publicly available. Of the 19,207 papers published, 1,667 (8.68%) are freely downloadable from an official repository. Moreover, only 938 papers (4.88%) meet the standard definition of open access in which material can be freely reused providing attribution to the authors is given. This compares poorly with a comparable set of 20 evolutionary biology journals, where 31.93% of papers are freely downloadable and 7.49% are open access.

These headline numbers seemed very disappointing – but, after some examination, it seems that the real figure may be substantially higher. Open access isn’t dead yet.

The authors’ definition of “open access” is given as “full” BOAI open-access – that is to say, the final published version made available with minimal restrictions on reuse, usually marked with the CC-BY license or something functionally equivalent. This is not my preference, but fairly reasonable given that “free access” is also considered.

However, their definition of “free access” is substantially more restrictive than the usual “green open access” (free to read but with limited reuse rights). It only covers articles made freely available as the version of record “from the journal’s official archive”:

If we were able to download a paper freely from the journal’s official archive from a private computer not linked to a university network but it did not conform to our definition of open access, we classified it as freely available. Such papers either had additional restrictions attached to them (e.g., excluding commercial reuse or the production of derivatives) or retained all rights and had simply been made freely available online temporarily or permanently by the license holder. We classified all remaining articles as subscription access.

This is a fairly specific requirement. Everything else was deemed unavailable, with an acknowledgement that some might be found in preprint servers:

We did not include access to journal articles via pre-print servers because these do not represent the final published version of the manuscript and can be hard for nonspecialists to navigate, although it is worth noting that preprint servers such as arXiv.org are major repositories of information in several disciplines including physics and mathematics and could play a role in access to conservation science if conservation articles reached a critical mass in such repositories.

Treating this as a divide between “journal archives” and “pre-print servers” entirely omits institutional repositories, which provide a significant amount of green open access material – in most disciplines, substantially more than is available through preprint servers. It will inevitably lead to a significant undercount of the amount of material available to the reader. Unfortunately, the paper’s abstract uses the phrase “freely downloadable from an official repository” – implying that repositories are covered by the scope of the study. (I had to read the paper twice to check they weren’t).

The concern about desiring the “final published version” is fair, but a) most people are satisfied with some form of the text, and b) many copies available from repositories are in fact the final published version of the text. This varies by publisher and title, but I have dealt with papers in both Oryx and Environmental Conservation, both on their shortlist, and know that Cambridge permits posting of the version of record in both subject and institutional repositories.

Finally, a substantial amount of “informal open access” exists with copies available through the author’s own websites, research group sites, semi-public networks such as ResearchGate, etc. While these may not always be entirely legitimate, they represent a very substantial amount of papers. A study found that around 48% of 2008 published papers had become available on a “free to read somewhere online” basis by 2012, if such informal sources were included.

Put together, it is clear that the 8.68% of “freely downloadable” papers omits a substantial amount of material which could be available to the non-subscribed reader through various means. How much? I don’t know, but I strongly suspect it’s at least as many again…