Projects and plans

15th January. A bit late for New Year’s resolutions, and I’m never much of a one for them anyway.

Still, it’s a good time to take stock. What am I hoping to achieve this year? I have omitted the personal aims, as they’re not of great interest to anyone who’s not me, but otherwise, hopefully without overcommitting myself…


2015 was pretty good. I planned a rather complex library move (twice, after the first time was delayed, which is a good way to learn from your mistakes without having actually had to commit them). Two weeks into the new year and ~400 metres of books shifted, it’s looking like it’s actually working, so let’s call that one a conditional success. First order of business: finish it off. And write up some notes on it so that others may learn to not do as I have done.

Secondly, get something published again. I had my first ‘proper’ academic publication in late 2015, and though it’s on a topic that approximately three people care about, I’m still glad it’s done and out there. (I have something to point at next time I’m glibly assured “oh, that approach never happens any more”. This is a recurrent theme in discussions about scholarly publishing; but I digress.) I would recommend it to any academic librarian as an exercise in understanding what your researchers suffer.

(I have a couple of projects on the boil which I’d like to write up properly, of which more anon.)

Thirdly, finish putting together the papers from the 2014 Polar Libraries Colloquy. Call this a public admittance of dragging my heels about this.

Lastly, consider Chartership. I’ve avoided this for many years, seeing it as a rather daunting pile of paperwork, but it’s probably a sensible thing to think about.


Firstly, I’d like to clear off the History of Parliament work on Wikidata. I haven’t really written this up yet (maybe that’s step 1.1) but, in short, I’m trying to get every MP in the History of Parliament database listed and crossreferenced in Wikidata. At the moment, we have around 5200 of them listed, out of a total of 22200 – so we’re getting there. (Raw data here.) Finding the next couple of thousand who’re listed, and mass-creating the others, is definitely an achievable task.

Secondly, and building on this, I did some work in the autumn of 2015 on building a framework for linking EveryPolitician and Wikidata. I need to pick this back up and work out how we can best represent politicians in general – what are the best data structures for things like constituencies, parliamentary terms, parties?

This leads into the third project, which is the general use of Wikidata as a “biographical spine”. Charles Matthews, Magnus Manske, and I have been working on this for a couple of years, and it really is beginning to bear fruit. We’re working to pull together as many large biographical databases as possible, and have them talking to one another through Wikidata, so that we can start bringing data and links from one to the users of another. This certainly won’t ever be completed in 2015 – but it would be good to write some of it up in a single report so that it’s clear what we’re doing, and hopefully start advertising it to researchers who could benefit.

Fourthly (oh, goodness), the Oxford Dictionary of National Biography. This is a project I embarked on back in 2013; the goal is to get a reliable crossreference between Wikipedia/Wikidata and the ODNB – now complete, mainly thanks to Charles Matthews – and then to fix all the vague unhelpful “see DNB” Wikipedia citations into nicely formatted, linkable ones, which readers can actually benefit from. This second part is going to take a long time, but I’ve made some rudimentary attempts at auto-predicting the required citations to be fixed by hand, and hopefully we’ll get there in time.

Moving away from Wikidata, early last year I started on what has turned into the Birthdays Project – an attempt to study the way in which people misremembered their birthdays when they’re not well-documented. This is generally known and the basic result is kind of obvious, but it has only been (very cursorily) discussed in the academic literature before, and I don’t think anyone’s properly attacked it with substantial data, multiple cultural contexts, etc. I wrote up a few notes on this in early 2015 (part 1, part 2), but since then I’ve nailed down some more data, figured out a useful way of visualising it, and so on. No idea if it’s publishable per se, but it would be good to have it written up.

That… looks like a busy year ahead.

Finally, going places and doing things. I have a couple of long-awaited holidays planned, and some people I’m looking forward to seeing on them. I will be going to the Polar Libraries Colloquy in Alaska, but I won’t be going to Wikimania in June – I’ll be elsewhere, sadly. I’m sad to miss this year, as it looks to be an excellent event.

Most popular videos on Wikipedia, 2015

One of the big outstanding questions for many years with Wikipedia was the usage data of images. We had reasonably good data for article pageviews, but not for the usage of images – we had to come up with proxies like the number of times a page containing that image was loaded. This was good enough as it went, but didn’t (for example) count the usage of any files hotlinked elsewhere.

In 2015, we finally got the media-pageviews database up and running, which means we now have a year’s worth of data to look at. In December, someone produced an aggregated dataset of the year to date, covering video & audio files.

This lists some 540,000 files, viewed an aggregated total of 2,869 million times over about 340 days – equivalent to 3,080 million over a year. This covers use on Wikipedia, on other Wikimedia projects, and hotlinked by the web at large. (Note that while we’re historically mostly concerned with Wikipedia pageviews, almost all of these videos will be hosted on Commons.) The top thirty:

14436640 President Obama on Death of Osama bin Laden.ogv
10882048 Bombers of WW1.ogg
10675610 20090124 WeeklyAddress.ogv
10214121 Tanks of WWI.ogg
9922971 Robert J Flaherty – 1922 – Nanook Of The North (Nanuk El Esquimal).ogv
9272975 President Obama Makes a Statement on Iraq – 080714.ogg
7889086 Eurofighter 9803.ogg
7445910 SFP 186 – Flug ueber Berlin.ogv
7127611 Ward Cunningham, Inventor of the Wiki.webm
6870839 A11v 1092338.ogg
6865024 Ich bin ein Berliner Speech (June 26, 1963) John Fitzgerald Kennedy trimmed.theora.ogv
6759350 Editing Hoxne Hoard at the British Museum.ogv
6248188 Dubai’s Rapid Growth.ogv
6212227 Wikipedia Edit 2014.webm
6131081 Newman Laugh-O-Gram (1921).webm
6100278 Kennedy inauguration footage.ogg
5951903 Hiroshima Aftermath 1946 USAF Film.ogg
5902851 Wikimania – the Wikimentary.webm
5692587 Salt March.ogg
5679203 CITIZENFOUR (2014) trailer.webm
5534983 Reagan Space Shuttle Challenger Speech.ogv
5446316 Medical aspect, Hiroshima, Japan, 1946-03-23, 342-USAF-11034.ogv
5434404 Physical damage, blast effect, Hiroshima, 1946-03-13 ~ 1946-04-08, 342-USAF-11071.ogv
5232118 A Day with Thomas Edison (1922).webm
5168431 1965-02-08 Showdown in Vietnam.ogv
5090636 Moon transit of sun large.ogg
4996850 President Kennedy speech on the space effort at Rice University, September 12, 1962.ogg
4983430 Burj Dubai Evolution.ogv
4981183 Message to Scientology.ogv

(Full data is here; note that it’s a 17 MB TSV file)

It’s an interesting mix – and every one of the top 30 is a video, not an audio file. I’m not sure there’s a definite theme there – though “public domain history” does well – but it’d reward further investigation…

Freedom of Information – why universities are, and should remain, subject

There has been an awful lot of discussion prompted by the recent Higher Education green paper (Higher education: teaching excellence, social mobility and student choice). The majority of this has focused on the major reforms it proposes to the structure of HE; I am not particularly qualified to comment on this, but I recommend Martin Eve’s ongoing series of responses for discussion of the proposals.

There is one bit, however, which I do feel qualified to comment on – because it happens to be exactly the topic on which I wrote my MSc thesis, some ten years ago. This is the proposal that universities should quietly be exempted from the Freedom of Information Act:

There are a number of requirements placed on HEFCE-funded providers which do not apply to alternative providers. Many derive from treating HEFCE-funded providers as ‘public bodies’. This is despite the fact that the income of nearly all of these providers is no longer principally from direct grant and tuition fee income is not treated as public funding. Alternative providers are not treated as public bodies. As a result there is an uneven playing field in terms of costs and responsibilities. For example, the cost to providers of being within the scope of the Freedom of Information Act is estimated at around £10m per year.
In principle, we want to see all higher education providers subject to the same requirements, and wherever possible we are seeking to reduce burdens and deregulate. However we may wish to consider some exceptions to this general rule if it were in the interest of students and the wider public.
Question 23: Do you agree with the proposed deregulatory measures? Please give reasons for your answer, including how the proposals would change the burden on providers. Please quantify the benefits and/or costs where possible.

Unsurprisingly, many universities are delighted with this suggestion – as any public body would be, if told that a little administrative tweak could remove their Freedom of Information obligations. However, the core problem is that FOI does not work this way; these “deregulatory measures” would have to involve amending the original Freedom of Information Act, which the proposal doesn’t quite seem to realise. And an incidental example in question 23 of an unrelated consultation – especially when a different consultation on FOI has just closed – is a fairly limited basis for making such a move!

What follows is a longer version of my intended response – I will condense it somewhat before submitting in January – but comments are welcome.

There are four problems with the specific proposal to remove Higher Education institutions from the scope of the Freedom of Information Act – i) the legal framework is complex, and a “public funds” test is not the sole issue involved; ii) in any case, institutions would remain publicly funded after these changes; iii) removing institutions from the scope of the Act would not produce a “level playing field” either in the UK or internationally; and iv) all these aside, including institutions in the scope of FOI brings a net benefit to the country.

  1. Firstly, the system by which Higher Education institutions become subject to the Freedom of Information Act 2000 is complex, and does not work as described by the proposal. There is no general “public bodies” test as such. Instead, under the Act, HE institutions can become conditionally subject through receiving HEFCE funds (schedule 1, para 53(1)(b)); through being designated as eligible to receive such funds (53(1)(d)); or through being an associated body (eg a constituent college) of any such institution (53(1)(e)). There is no test for the amount or proportion of income represented by this funding, so the note in para 17 of the proposal that “…the income of nearly all of these providers is no longer principally from direct grant” is moot.
  2. In addition, however, any institution operating within the Further Education sector is automatically subject to the Act (53(1)(a)) as is any institution operated by a Higher Education Corporation (53(1)(c)). These provisions are not conditional and are not affected by their sources of funding. Were all public funds of all kinds to be withdrawn overnight, the Act as it exists would still leave any HEC explicitly subject to FOI.
  3. This sits strangely alongside the general thrust of this section, which is structured around increasing the powers and capabilities of HECs. Removing the link between FOI and HEFCE would exempt one group (predominantly older and more influential institutions) while leaving the other entirely subject to the Act. For example, the University of Oxford would be exempt, but Oxford Brookes University would not. The alternative would be to remove all HE institutions, including HECs, from the scope of the Act – but this is not a proposal raised in the consultation, which has chosen to focus on the argument that public funds are the main driver for FOI applicability.
  4. This leads into the second point, the definition of “public funds”. If we were to accept the position that “public funds” is the key test to determine FOI applicability, it is clear that there would still be substantial public monies channeled into the higher education system after the effects of the ongoing reforms. Tuition fees, though notionally private payments, are supported by a publicly-organised loan scheme. The public purse will underwrite the loans that are used to fund tuition fees, and make good losses that arise through long-term defaults or writedowns. It is hard to see this as devoid of public involvement.
  5. Meanwhile, the broad outline of public research funding will not substantially change. The government has committed to maintaining the dual support system, and while the review is consulting on how best this can be structured (see eg Questions 24 and 25) it is clear that institutions will continue to receive income in a similar form, from a body which has taken over the existing HEFCE research funding role. This is undeniably public funds, and – importantly – as it currently comes through HEFCE, it would trigger the FOI applicability requirements even were tuition costs to vanish entirely from consideration. Funding from the research councils is also substantial, and again comes from public sources.
  6. There are also other non-trivial (though relatively smaller) sources of public income for HEIs, including grants for providing FE courses, public sector capital spending, income from NHS trusts or local authorities, etc. While perhaps not enough to constitute public funding in and of themselves, they do support the position that, broadly speaking, these institutions remain publicly funded despite the question of tuition fees.
  7. Thirdly, the consultation raised concerns about a “level playing field” among institutions. If HEIs were to be removed wholesale from the 2000 Act, it might or might not materially affect the FOI status of Welsh or Northern Irish universities (who would be covered by a change to the 2000 Act, but have different funding systems), but could not affect the FOI status of Scottish universities (controlled by the Freedom of Information Act (Scotland) 2002) – leading inexorably back to an unequal playing field across the UK.
  8. Internationally, there are similar problems. The position that “public” but not “private” universities should be subject to Freedom of Information regulations is a widely accepted principle across a range of countries, ranging from Bulgaria to New Zealand. In 2005, I carried out a study which identified that in 67 countries with FOI-type legislation, 39 included public universities in the scope of the legislation, 27 were unclear, and only one explicitly excluded them – and this one was planning to extend the scope of the law. In the majority of jurisdictions, private universities were not covered, though some countries extended limited FOI powers to certain aspects of their work. Under any reasonable definition, the existing “public” British universities will remain quasi-public institutions. They will continue to receive public funds through various channels, and to be heavily influenced by government policy. If asked, the architects of these proposed reforms would no doubt – emphatically and repeatedly – state that they do not consider it a privatisation, and the university governing bodies would agree. Given this, withdrawing their FOI compliance requirement would be unusual; it would place them in a different legal position to most of their overseas counterparts.
  9. Finally, applying Freedom of Information laws to universities is, and will remain, a net good. The cost to the sector – ultimately borne by the public purse – is minor in comparison to the benefits from transparency and efficiency that FOI can bring. This is true for universities as much as it is for other sectors.
  10. From a national perspective, these bodies are responsible for spending several billion pounds of public money, and for implementing substantial portions of the government’s policies not just on education, but on issues as varied as social inclusion, visitor visas, and industrial development. All of these are matters of substantial public interest. On an individual basis, these bodies can have remarkably broad powers. They regulate employment, housing, and substantial portions of daily life for hundreds of thousands of people. In areas with a very high student population, they can have an impact on their local communities rivalling that of the council! The benefits from public awareness and oversight of these roles is substantial.
  11. One concern raised by universities is that these requests pose a heavy burden on the sector and are often frivolous. It is worth considering some numbers here. In 2013 (a year with a “huge increase” in FOI requests), surveyed institutions received an average of 184 submissions; across the 160 universities in the country (including Scotland), this would suggest a total of around 30,000 submissions. 93% of these queries were handled in good time. 54.4% were disclosed in full, 24.3% were provided in part, and just 8.5% were fully withheld. Only 6.6% were rejected as the information was not held by the institution, and 0.3% rejected as vexatious. The remainder were withdrawn, still in progress, or of unclear status. 1.1% of rejected or partially filled requests prompted a request for an internal review, and slightly over half of these were upheld. Only 0.1% were referred to an external appeal (the Information Commissioner) and exactly half of these were upheld.
  12. These figures suggest that the universities are dealing with their FOI requirements cleanly, sensibly, and in good order – probably better than many other public bodies, and credit to them for it. It does not bear signs of a looming catastrophe. Institutions are disclosing information they are asked for in more than three quarters of cases, indicating that it is material that can and should be publicly available, but has so far required the use of FOI legislation to obtain it. They are not dealing with a substantial number of frivolous requests (in this sample, an average of just five requests per university per year were declined as vexatious or repeated). And, when their actions are challenged and reviewed, the decisions indicate that institutions are striking a reasonable balance between caution and disclosure, and that the enquiries are often reasonable and justified.
  13. It is certainly the case that implementing FOI can be expensive. However, all good records management practice will cost more money than simply ignoring the problem! It is likely that a substantial proportion of the costs currently considered as “FOI compliance” would be required, in any case, to handle compliance with other legislation – such as the Data Protection Act or the Environmental Information Regulations – or to handle routine internal records management work. The quoted figure of £10m per year compliance costs should, thus, be considered with a certain caution – a substantial amount of this money would likely be spent as business as usual without FOI.
  14. FOI has an unusual position here in that it can be dealt with pre-emptively, by transitioning to a policy of routine publication of information that would be routinely disclosed, and by empowering staff to deal with many non-controversial requests for information as “business as usual” rather than referring them for internal FOI review. For example, it is noticeable that the majority of FOI enquiries relate to “student issues and numbers”. A substantial proportion of these relate to admission statistics, and similar topics; this is information that could easily be routinely and uncontroversially published without waiting for a request, reviewing the request, discussing it internally, and then agreeing to publish.
  15. In conclusion, this proposal i) cannot work as planned; ii) is based on a tenuous and restrictive interpretation of what constitutes a public body; iii) if implemented, will affect some institutions substantially more than it does others; and iv) is, in any case, undesirable as a policy, and would be unlikely to lead to significant savings.
  16. Should a “level playing field” be desired, a far more equitable solution would be to consider extending the scope of the Act to encompass the “private” HE institutions, perhaps in a more limited fashion appropriate to their status. The driving factors which make robust freedom of information regulations important for “public” institutions are no less valid for “private” ones; they carry out a similar quasi-public role and, especially from a student perspective, it seems unreasonable for them to have reduced rights simply due to the legal status of their university. Partially extending the legislation to cover private institutions would be unusual, but not unprecedented, by international standards.

Page and colour charges: they’re still a thing

So, I have a paper out! Very exciting – this is my first ‘proper’ academic publication (and it came out the day after my birthday, so there’s that, too.)

Gray, Andrew (2015). Considering Non-Open Access Publication Charges in the “Total Cost of Publication”. Publications 2015, 3(4), 248-262; doi:10.3390/publications3040248

Recent research has tried to calculate the “total cost of publication” in the British academic sector, bringing together the costs of journal subscriptions, the article processing charges (APCs) paid to publish open-access content, and the indirect costs of handling open-access mandates. This study adds an estimate for the other publication charges (predominantly page and colour charges) currently paid by research institutions, a significant element which has been neglected by recent studies. When these charges are included in the calculation, the total cost to institutions as of 2013/14 is around 18.5% over and above the cost of journal subscriptions—11% from APCs, 5.5% from indirect costs, and 2% from other publication charges. For the British academic sector as a whole, this represents a total cost of publication around £213 million against a conservatively estimated journal spend of £180 million, with non-APC publication charges representing around £3.6 million. A case study is presented to show that these costs may be unexpectedly high for individual institutions, depending on disciplinary focus. The feasibility of collecting this data on a widespread basis is discussed, along with the possibility of using it to inform future subscription negotiations with publishers.

The problem

So what’s this all about, then?

We (in the UK particularly) have spent a lot of effort trying to reduce the cost of the scholarly publishing system, which is remarkably high; British university libraries collectively spend £180,000,000 per year on subscriptions, comparable to the entire budget of one of the smaller research councils. The major driver here is open access – trying to make research available to read without charges – and so there has been a lot of interest in trying to arrange matters so that the costs of publishing open access don’t rise faster than the corresponding reduction in subscriptions. The general term for this is the “total cost of publication” (TCP) – ie, the costs of all the parts of the system, including both direct spending and indirect management costs (it’s surprising how much it costs to shuffle paperwork).

This is a sensible goal – it keeps the net cost under control – but the focus on OA costs and subscriptions misses out some other contributions to the balance sheet.

Historically, a lot of the cost of scholarly publishing was borne by authors or their institutions through publication charges – page charges, colour charges, submission charges, and a few other oddities. These became less common (for various reasons, and there’s an interesting history to be written) through the 1980s, and – outside of open-access article processing charges – compulsory publication charges are now rare for most journals in most fields. To many researchers (including a lot of those who’ve helped set OA policy), they simply don’t exist as a significant concern.

However, during 2013-14 it became rapidly apparent to me that my institution was spending a lot of money on page charges, which didn’t fit with what was being reported elsewhere, and didn’t fit with the general recommendations from the funding bodies on how to allocate costs. These charges were not being taken into consideration in the various TCP offsetting schemes, with the effect that we were seeing a lot of spending going direct to publishers, but outside the carefully constructed framework for controlling costs.

The study

I dug back through the recent literature on the costs of journal publishing – there had been a flurry of studies in the early 2000s as people began to work out how to handle OA costs – and tried to determine what the levels of other “publication charges” had been just before OA spending took off. It turned out to be tricky to come up with a firm estimate, but my best guess was that non-OA publication charges were around 3-5% of subscription costs in 2004-5, and had dropped since then. By now (ie 2013/14), it’s probably around 2%, assuming a continual gentle decline.

Firstly, this is quite a lot of money. If British universities spend £180,000,000 per year, then 2% is a further £3,600,000 – comparable to forty or fifty PhD studentships. It’s particularly striking when we bear in mind that this is money many institutions may not realise they are spending.

Secondly, it’s clear that the cost is distributed very erratically. My own institution spent the equivalent of 15-18% of its subscription budget on non-OA publication charges, driven mainly by very heavy page charges in certain well-used earth sciences journals. (From another angle, Frank Norman has since reported that his institution, in biomedicine, had non-OA publication charges equal to about 10% of subscriptions, and in the early 2000s it was three times that.) Given the disciplinary concentration, it’s likely that spending in universities is similarly patchy – individual departments may have dramatically higher publication costs than the overall average.

Thirdly, this spending is, currently, invisible to policymakers. Of the 29 institutions who provided article-level spending records for 3,721 papers in 2014, only fifteen individual papers could be identified as having page or colour charges (mostly at Leeds), with another ten mentioned in the general reports. Twenty-five papers is clearly not going to get us anywhere near the overall spending estimates. This data isn’t being collected centrally by RCUK/JISC – who are otherwise doing sterling work on tracking APCs – and it’s not clear if it even gets collected centrally by universities. The majority of non-OA publication charges may just disappear into the morass of “miscellaneous spending” in grant budgets.

Where next?

Firstly, we need to get a good idea of what’s actually being spent. My 2% estimate is a pretty wide one – I wouldn’t be surprised if it was 1% or 3%, or further away. The methodology we used was quite time-consuming – effectively identifying every paper with possible charges and chasing the authors to confirm – but it did work. Perhaps a better method, for larger institutions, would be sampling the departments with probable concentrations of page charges, or it might be that some institutions have robust enough finance systems that a lot of cases can be identified with a bit of research. Perhaps we can even obtain this information direct from publishers. Whatever method is used, the existing RCUK/JISC APC reporting infrastructure offers a good way to report it to a central body for aggregation, deduplication, and republication.

Secondly, we need to account for non-OA publication charges as part of the total cost of publication. They are smaller than APCs, but they are very significant for some institutions. While it may not be appropriate to use the same offsetting schemes, if they’re not brought into the equation there will be an risk that publishers are tempted to increase them dramatically – an extra revenue stream which is not capped and controlled in the way that subscriptions and APCs are. There’s no sign that anyone is doing this now – and most of the major commercial publishers no longer use page charges – but it remains a concern.

Lastly – the “more research is needed” section – there are two big questions still outstanding for the total cost of publication, even with this new element added.

  • What about the indirect costs of subscription publishing? We have a good handle on the indirect costs of running repositories and handling OA payments, but we have no idea what the infrastructure to keep a subscripton system working costs us. This might include, for example, things like – the cost of staff time to manage subscriptions; the cost of staff time to run authentication and proxy servers; the cash cost of third-party authentication services like Athens; the cost to the publishers of maintaining security barriers; the cost in wasted researcher time trying to obtain material; &c.
  • If everything is expressed as a proportion of subscription spending, how much is that? My £180,000,000 figure is an inflation-adjusted estimate, based on data from SCONUL in 2010/11. There have been more recent SCONUL surveys, but not published. A firm understanding of how much we actually spend is vital to actually make sense of these results.

Watching the Antarctic days roll by

[Note: this post embeds some very large gif files. Cancel now if on a slow connection…]

A while ago, I was playing with imagemagick (it’s an amazing tool) and trying to make animated gifs. It worked, sort of. One of the things I’d been meaning to try for a while – but never quite got around to – was animating webcam images. Last week, I finally got around to it.

At work, we have a webcam pointed at the Halley VI Antarctic station. It’s turned on year-round, sending back one picture hourly, fairly reliably. Being on a pole in the middle of Antarctica, it’s also free from the major problem that arises when trying to animate webcams – someone moving them around every now and again.

And the pictures are remarkable. Halley VI is an imposing-looking building at the best of times, but on a dark morning, looming out of a snowstorm, it’s like something from a film.

Twenty days in late November 2014 – note the sun tracking by the top of the image each day.

Ten days at the end of January 2015, with 24-hour daylight and a lot of activity around the station.

One shot each day (at 12.30pm UK time, so about 10am? local solar time), chained over 373 days – so slightly more than a full year. It opens in mid-November 2014, about the time the first aircraft arrive and the summer activities begin, passes through the (very busy) summer season, then quietens down as winter approaches. The nights appear as momentary flashes, then get longer and longer until they’re permanently dark in June/July. Then it slowly returns…

The code for this is pretty simple. Assemble all the files in a single directory – either sourced locally or downloaded with wget/curl – and ensure they’re named in a sequential way. All of these, for example, were of the form halley-2015-01-02-12-30.jpg – the 12.30 shot on January 1st.

Make sure to delete any that returned error messages in the download or are below a certain size. I had one or two zero-content frames that made the system hiccup a bit, and find images/*.jpg -size 0 -delete is good for handling these.

Then run:

convert -resize 500x500 images/*.jpg animation.gif

That’s it. The resize is to prevent it getting disgustingly large; adding -optimize shaves a little more off the filesize. Even so, though, you’ll find that assembling more than a few hundred frames makes your system quite unhappy (it may lock up) and the resulting gif is far too large to be useful. For the images above, some examples of filters on the merge:

convert -resize 500x500 images/halley-2015-01-2*.jpg animation.gif

convert -resize 500x500 images/*12-30.jpg animation.gif

– so it only pulled together the frames we were interested in. Of course, you could do a simpler (or more complex) merge by copying the relevant ones to a separate directory and just merging everything there.

Given the size problems of gifs, making a larger one is probably best left to video. Here’s the entire year, using every frame (23 MB):

A year at Halley VI

Note how short the day/night pulses get towards the ends of the spring/autumn.

For this, you don’t have to resize, and you can produce it at the full size of the webcam images (in this case, 1920×1080):

mencoder mf://images/*.jpg -mf w=1920:h=1080:fps=25:type=jpg -ovc lavc -lavcopts vcodec=mpeg4:mbd=2:trell -oac copy -o halley.avi

The key part here is the images list (you can filter again as before) and the fps=25; I ran it at various speeds and found 40fps seemed to be a happy medium. 25fps is just a little jerky. The version above is reduced to 512px wide:

mencoder mf://images/*.jpg -mf w=1920:h=1080:fps=25:type=jpg -vf scale=512:288 -ovc lavc -lavcopts vcodec=mpeg4:mbd=2:trell -oac copy -o halley.avi

Taking pictures with flying government lasers

Well, sort of.

A few weeks ago, the Environment Agency released the first tranche of their LIDAR survey data. This covers (most of) England, at varying resolution from 2m to 25cm, made via LIDAR airborne survey.

It’s great fun. After a bit of back-and-forth (and hastily figuring out how to use QGIS), here’s two rendered images I made of Durham, one with buildings and one without, now on Commons:

The first is shown with buildings, the second without. Both are at 1m resolution, the best currently available for the area. Note in particular the very striking embankment and cutting for the railway viaduct (top left). These look like they could be very useful things to produce for Commons, especially since it’s – effectively – very recent, openly licensed, aerial imagery…

1. Selecting a suitable area

Generating these was, on the whole, fairly easy. First, install QGIS (simplicity itself on a linux machine, probably not too much hassle elsewhere). Then, go to the main data page and find the area you’re interested in. It’s arranged on an Ordnance Survey grid – click anywhere on the map to select a grid square. Major grid squares (Durham is NZ24) are 10km by 10km, and all data will be downloaded in a zip file containing tiles for that particular region.

Let’s say we want to try Cambridge. The TL45 square neatly cuts off North Cambridge but most of the city is there. If we look at the bottom part of the screen, it offers “Digital Terrain Model” at 2m and 1m resolution, and “Digital Surface Model” likewise. The DTM is the version just showing the terrain (no buildings, trees, etc) while the DSM has all the surface features included. Let’s try the DSM, as Cambridge is not exactly mountainous. The “on/off” slider will show exactly what the DSM covers in this area, though in Cambridge it’s more or less “everything”.

While this is downloading, let’s pick our target area. Zooming in a little further will show thinner blue lines and occasional superimposed blue digits; these define the smaller squares, 1 km by 1 km. For those who don’t remember learning to read OS maps, the number on the left and the number on the bottom, taken together, define the square. So the sector containing all the colleges along the river (a dense clump of black-outlined buildings) is TL4458.

2. Rendering a single tile

Now your zip file has downloaded, drop all the files into a directory somewhere. Note that they’re all named something like tl4356_DSM_1m.asc. Unsurprisingly, this means the 1m DSM data for square TL4356.

Fire up QGIS, go to Layer > Add raster layer, and select your tile – in this case, TL4458. You’ll get a crude-looking monochrome image, immediately recognisable by a broken white line running down the middle. This is the Cam. If you’re seeing this, great, everything’s working so far. (This step is very helpful to check you are looking at the right area)

Now, let’s make the image. Project > New to blank everything (no need to save). Then Raster > Analysis > DEM (terrain models). In the first box, select your chosen input file. In the next box, the output filename – with a .tif suffix. (Caution, linux users: make sure to enter or select a path here, otherwise it seems to default to home). Leave everything else as default – all unticked and mode: hillshade. Click OK, and a few seconds later it’ll give a completed message; cancel out of the dialogue box at this point. It’ll be displaying something like this:

Congratulations! Your first LIDAR rendering. You can quit out of QGIS (you can close without saving, your converted file is saved already) and open this up as a normal TIFF file now; it’ll be about 1MB and cover an area 1km by 1km. If you look closely, you can see some surprisingly subtle details despite the low resolution – the low walls outside Kings College, for example, or cars on the Queen’s Road – Madingley Road roundabout by the top left.

3. Rendering several tiles

Rendering multiple squares is a little trickier. Let’s try doing Barton, which conveniently fits into two squares – TL4055 and TL4155. Open QGIS up, and render TL4055 as above, through Raster > Analysis > DEM (terrain models). Then, with the dialogue window still open, select TL4155 (and a new output filename) and run it again. Do this for as many files as you need.

After all the tiles are prepared, clear the screen by starting a new project (again, no need to save) and go to Raster > Miscellaneous > Merge. In “Input files”, select the two exports you’ve just done. In “Output file”, pick a suitable filename (again ending in .tif). Hit OK, let it process, then close the dialog. You can again close QGIS without saving, as the export’s complete.

The rendering system embeds coordinates in the files, which means that when they’re assembled and merged they’ll automatically slot together in the correct position and orientation – no need to manually tile them. The result should look like this:

The odd black bit in the top right is the edge of the flight track – there’s not quite comprehensive coverage. This is a mainly agricultural area, and you can see field markings – some quite detailed, and a few bits on the bottom of the right-hand tile that might be traces of old buildings.

So… go forth! Make LIDAR images! See what you can spot…

4. Command-line rendering in bulk

Richard Symonds (who started me down this rabbit-hole) points out this very useful post, which explains how to do the rendering and merging via the command line. Let’s try the entire Durham area; 88 files in NZ24, all dumped into a single directory –

for i in `ls *.asc` ; do gdaldem hillshade -compute_edges $i $i.tif ; done -o NZ24-area.tif *.tif

rm *.asc.tif

In order, that a) runs the hillshade program on each individual source file ; b) assembles them into a single giant image file; c) removes the intermediate images (optional, but may as well tidy up). The -compute_edges flag helpfully removes the thin black lines between sectors – I should have turned it on in the earlier sections!

Graphing Shakespeare

Today I came across a lovely project from JSTOR & the Folger Library – a set of Shakespeare’s plays, each line annotated by the number of times it is cited/discussed by articles within JSTOR.

“This is awesome”, I thought, “I wonder what happens if you graph it?”

So, without further ado, here’s the “JSTOR citation intensity” for three arbitrarily selected plays:

Blue is numbers of citations per line; red is no citations. In no particular order, a few things that immediately jumped out at me –

  • basically no-one seems to care about the late middle – the end of Act 2 and the start of Act 3 – of A Midsummer Night’s Dream;
  • “… a tale / told by an idiot, full of sound and fury, / signifying nothing” (Macbeth, 5.5) is apparently more popular than anything else in these three plays;
  • Othello has far fewer “very popular” lines than the other two.

Macbeth has the most popular bits, and is also the most densely cited – only 25.1% of its lines were never cited, against 30.3% in Othello and 36.9% in A Midsummer Night’s Dream.

I have no idea if these are actually interesting thoughts – my academic engagement with Shakespeare more or less reached its high-water mark sixteen years ago! – but I liked them…

How to generate these numbers? Copy-paste the page into a blank text file (text), then use the following bash command to clean it all up –

grep "FTLN " text | sed 's/^.*FTLN/FTLN/g' | cut -b 10- | sed 's/[A-Z]/ /g' | cut -f 1 -d " " | sed 's/text//g' > numberedextracts

Paste into a spreadsheet against a column numbered 1-4000 or so, and graph away…

Canadian self-reported birthday data

In the last post, we saw strong evidence for a “memorable date” bias in self-reported birthday information among British men born in the late 19th century. In short, they were disproportionately likely to think they were born on an “important day” such as Christmas.

It would be great to compare it to other sources. However, finding a suitable dataset is challenging. We need a sample covering a large number of men, over several years, and which is unlikely to be cross-checked or drawn from official documentation such as birth certificates or parish registers. It has to explicitly list full birthdates (not just month or year)

WWI enlistment datasets are quite promising in this regard – lots of men, born about the same time, turning up and stating their details without particularly much of a reason to bias individual dates. The main British records have (famously) long since burned, but the Australian and Canadian records survive. Unfortunately, the Australian index does not include dates of birth, but the Canadian index does (at least, when known). So, does it tell us anything?

The index is available as a 770mb+ XML blob (oh, dear). Running this through xmllint produces a nicely formatted file with approximately 575,000 birthdays for 622,000 entries. It’s formatted in such a way as to imply there may be multiple birthdates listed for a single individual (presumably if there’s contradictory data?), but I couldn’t spot any cases. There’s also about ten thousand who don’t have nicely formatted dd/mm/yyyy entries; let’s omit those for now. Quick and dirt but probably representative.

And so…

There’s clearly a bit more seasonality here than in the British data (up in spring, down in winter), but also the same sort of unexpected one-day spikes and troughs. As this is quite rough, I haven’t corrected for seasonality, but we still see something interesting.

The highest ten days are: 25 December (1.96), 1 January (1.77), 17 March (1.56), 24 May (1.52), 1 May (1.38), 15 August (1.38), 12 July (1.36), 15 September (1.34), 15 March (1.3).

The lowest ten days are: 30 December (0.64), 30 January (0.74), 30 October (0.74), 30 July (0.75), 30 May (0.78), 13 November (0.78), 30 August (0.79), 26 November (0.80), 30 March (0.81), 12 December (0.81).

The same strong pattern for “memorable days” that we saw with the UK is visible in the top ten – Christmas, New Year, St. Patrick’s, Victoria Day, May Day, [nothing], 12 July, [nothing], [nothing].

Two of these are distinctively “Canadian” – both 24 May (the Queen’s birthday/Victoria Day) and 12 July (the Orange Order marches) are above average in the British data, but not as dramatically as they are here. Both appear to have been relatively more prominent in late-19th/early-20th century Canada than in the UK. Canada Day/Dominion Day (1 July) is above average but does not show up as sharply, possibly because it does not appear to have been widely celebrated until after WWI.

One new pattern is the appearance of the 15th of the month in the top 10. This was suggested as likely in the US life insurance analysis and I’m interested to see it showing up here. Another oddity is leap years – in the British data, 29 February was dramatically undercounted. In the Canadian data, it’s strongly overcounted – just not quite enough to get into the top ten. 28 February (1.28), 29 February (1.27) and 1 March (1.29) are all “memorable”. I don’t have an explanation for this but it does suggest an interesting story.

Looking at the lowest days, we see the same pattern of 30/xx dates being very badly represented – seven of the ten lowest dates are 30th of the month…. and all from days where there were 31 days in the month. This is exactly the same pattern we observed in UK data, and I just don’t have any convincing reason to guess why. The other three dates all fall in low-birthrate months,

So, in conclusion:

  • Both UK and Canadian data from WWI show a strong bias for people to self-report their birthday as a “memorable day”;
  • “Memorable” days are commonly a known and fixed festival, such as Christmas;
  • Overreporting of arbitrary numbers like the 15th of the month are more common in Canada (& possibly the US?) than the UK;
  • The UK and Canadian samples seem to treat 29 February very differently – Canadians overreport, British people underreport;
  • There is a strong bias against reporting the 30th of the month particularly in months with 31 days

Thoughts (or additional data sources) welcome.

When do you think you were born?

Back in the last post, we were looking at a sample of dates-of-birth in post-WWI Army records.

(To recap – this is a dataset covering every man who served in the British Army after 1921 and who had a date of birth in or before 1900. 371,716 records in total, from 1864 to 1900, strongly skewed towards the recent end.)

I’d suggested that there was an “echo” of 1914/15 false enlistment in there, but after a bit of work I’ve not been able to see it. However, it did throw up some other very interesting things. Here’s the graph of birthdays.

Two things immediately jump out. The first is that the graph, very gently, slopes upwards. The second is that there are some wild outliers.

The first one is quite simple to explain; this data is not a sample of men born in a given year, but rather those in the army a few decades later. The graph in the previous post shows a very strong skew towards younger ages, so for any given year we’d expect to find marginally more December births than January ones. I’ve normalised the data to reflect this – calculated what the expected value for any given day would be assuming a linear increase, then calculated the ratio of reported to expected births. [For 29 February, I quartered its expected value]

There are hints at a seasonal pattern here, but not a very obvious one. January, February, October and November are below average, March and September above average, and the rest of the spring-summer is hard to pin down. (For quite an interesting discussion on “European” and “American” birth seasonality, see this Canadian paper)

The interesting bit is the outliers, which are apparent in both graphs.

The most overrepresented days are, in order of frequency, 1 January (1.8), 25 December (1.43), 17 March (1.33), 28 February (1.27), 14 February (1.22), 1 May (1.22), 11 November (1.19), 12 August (1.17), 2 February (1.15), and 10 October (1.15). Conversely, the most underrepresented days are 29 February (0.67 after adjustment), 30 July (0.75), 30 August (0.78), 30 January (0.81), 30 March (0.82), and 30 May (0.84).

Of the ten most common days, seven are significant festivals. In order: New Year’s Day, Christmas Day, St. Patrick’s Day, [nothing], Valentine’s Day, May Day, Martinmas, [nothing], Candlemas, [nothing].

Remember, the underlying bias of most data is that it tells you what people put into the system, not what really happened. So, what we have is a dataset of what a large sample of men born in late nineteenth century Britain thought their birthdays were, or of the way they pinned them down when asked by an official. “Born about Christmastime” easily becomes “born 25 December” when it has to go down on a form. (Another frequent artefact is overrepresentation of 1-xx or 15-xx dates, but I haven’t yet looked for this.) People were substantially more likely to remember a birthday as associated with a particular festival or event than they were to remember a random date.

It’s not all down to being memorable, of course; 1 January is probably in part a data recording artefact. I strongly suspect that at some point in the life of these records, someone’s said “record an unknown date as 1/1/xx”.

The lowest days are strange, though. 29 February is easily explained – even correcting for it being one quarter as common as other days, many people would probably put 28 February or 1 March on forms for simplicity. (This also explains some of the 28 February popularity above). But all of the other five are 30th of the month – and all are 30th of a 31-day month. I have no idea what might explain this. I would really, really love to hear suggestions.

One last, and possibly related, point – each month appears to have its own pattern. The first days of the month are overrepresented; the last days underrepresented. (The exception is December and possibly September). This is visible in both normalised and raw data, and I’m completely lost as to what might cause it…

Back to the Army again

In the winter of 1918-19, the British government found itself in something of a quandary. On the one hand, hurrah, the war was over! Everyone who had signed up to serve for “three years or the duration” could go home. And, goodness, did they want to go home.

On the other hand, the war… well it wasn’t really over. There were British troops fighting deep inside Russia; there were large garrisons sitting in western Germany (and other, less probable, places) in case the peace talks collapsed; there was unrest around the Empire and fears about Bolsheviks at home.

So they raised another army. Anyone in the army who volunteered to re-enlist got a cash payment of £20 to £50 (no small sum in 1919); two month’s leave with full pay; plus comparable pay to that in wartime and a separation allowance if he was married. Demobilisation continued for everyone else (albeit slowly), and by 1921, this meant that everyone in the Army was either a very long-serving veteran, a new volunteer who’d not been conscripted during wartime (so born 1901 onwards) or – I suspect the majority – re-enlisted men working through their few years service.

For administrative convenience, all records of men who left up to 1921 were set aside and stored by a specific department; the “live” records, including more or less everyone who reenlisted, continued with the War Office. They were never transferred – and, unlike the pre-1921 records, they were not lost in a bombing raid in 1940.

The MoD has just released an interesting dataset following an FOI request – it’s an index of these “live” service records. The records cover all men in the post-1921 records with a DoB prior to 1901, and thus almost everyone in it would have either remained in service or re-enlisted – there would be a small proportion of men born in 1900 who escaped conscription (roughly 13% of them would have turned 18 just after 11/11/18), and a small handful of men will have re-enlisted or transferred in much later, but otherwise – they all would have served in WWI and chosen to remain or to return very soon after being released.

So, what does this tell us? Well, for one thing, there’s almost 317,000 of them. 4,864 were called Smith, 3,328 Jones, 2,104 Brown, 1,172 Black, etc. 12,085 were some form of Mac or Mc. And there are eight Singhs, which looks like an interesting story to trace about early immigrants.

But, you know, data cries out to be graphed. So here’s the dates of birth.

Since the 1900 births are probably an overcount for reenlistments, I’ve left these off.

It’s more or less what you’d expect, but on close examination a little story emerges. Look at 1889/90; there’s a real discontinuity here. Why would this be?

Pre-war army enlistments were not for ‘the duration’ (there was nothing to have a duration of!) but for seven years service and five in the reserves. There was a rider on this – if war broke out, you wouldn’t be discharged until the crisis was over. The men born 1900 would have enlisted in 1908 and been due for release to the reserves in 1915. Of course, that never happened… and so, in 1919, many of these men would have been 29, knowing no other career than soldiering. Many would have been thrilled to get out – and quite a few more would have considered it, and realised they had no trade, and no great chance of good employment. As Kipling had it in 1894:

A man o’ four-an’-twenty what ‘asn’t learned of a trade—
Except “Reserve” agin’ him—’e’d better be never made.

It probably wasn’t much better for him in 1919.

Moving right a bit, 1896-97 also looks odd – this is the only point in the data where it goes backwards, with marginally more men born in 1896 than 1897. What happened here?

Anyone born before August 1896 was able to rush off and enlist at the start of the war; anyone born after that date would either have to wait, or lie. Does this reflect a distant echo of people giving false ages in 1914/15 and still having them on the paperwork at reenlistment? More research no doubt needed, but it’s an interesting thought.