Our man in Havana (or, Q56761118)

Has it really been a year since I posted here? Oh, dear. Well. So, this Friday/Saturday I went to the FCO’s hackathon event – wonderfully titled “Who Was Our Man In Havana?” – to have a play with a dataset of British diplomats.

My goal was to try and synch this up with Wikidata in some way – there were obviously some overlaps with the MPs project, but given how closely tied the diplomatic service has been into the establishment, it seemed likely there would be a lot of overlap. The objective of the event was to produce some kind of visualisation/interface, so after a bit of discussion with my team-mates we decided to get the data cleaned up, import some into Wikidata, and pull it out again in an enriched fashion.

The data cleaning was a bit of a challenge. Sev and Mohammed, my team-mates, did excellent work hacking away at the XML and eventually produced a nice, elegantly-parsed, version of the source data.

I uploaded this into Magnus’s mix-and-match tool, using a notional ID number which we could tie back to the records. Hammering away at mix-and-match that evening got me about 400 initial matches to work with. While I was doing this, Sev and Mohammed expanded the XML parsing to include all the positions held plus dates, tied back to the notional IDs in mix-and-match.

On Saturday, I wrote a script to pull down the mix-and-match records, line them up with the expanded parsing data, and put that into a form that could be used for QuickStatements. Thankfully, someone had already established a clear data model for diplomatic positions, so I was able to build on that to work out how to handle the positions without having to invent it from scratch.

The upload preparation was necessarily a messily manual process – I ended up compromising with a script generating a plain TSV which I could feed into a spreadsheet and then manually lookup (eg) the relevant Wikidata IDs for positions. If I’d had more time we could have put together something which automatically looked up position IDs in a table and then produced a formatted sheet (or even sent it out through something like wikidata-cli, but I wanted a semi-manual approach for this stage so I could keep an eye on the data and check it was looking sensible. (Thanks at this point also to @tagishsimon, who helped with the matching and updating on mix-and-match). And then I started feeding it in, lump by lump. Behold, success!

While I was doing this, Mohammed assembled a front-end display, which used vue.js to format and display a set of ambassadors drawn from a Wikidata SPARQL query. It concentrated on a couple of additional things to demonstrate the enrichment available from Wikidata – a picture and some notes of other non-ambassadorial positions they’d held.

To go alongside this, as a demonstration of other linkages that weren’t exposed in our tool, I knocked up a couple of quick visualisations through the Wikidata query tool: a map of where British ambassadors to Argentina were born (mainly the Home Counties and India!), or a chart of where ambassadors/High Commissioners were educated (Eton, perhaps unsurprisingly, making a good showing). It’s remarkable how useful the query service is for whipping up this kind of visualisation.

We presented this on Saturday afternoon and it went down well – we won a prize! A bottle of wine and – very appropriately – mugs with the famed Foreign Office cat on them. A great weekend, even if it did mean an unreasonably early Saturday start!

So, some thoughts on the event in conclusion:

  • It was very clear how well the range of skills worked at an event like this. I don’t think any of us could have produced the result on our own.
  • A lot of time – not just our group, but everyone – was spent parsing and massaging the (oddly structured) XML. Had the main lists been available as a CSV/TSV, this might have been a lot quicker. I certainly wouldn’t have been able to get anywhere with it myself.
  • On the data quality note, we were lucky that the names of records were more or less unique strings, but an ID number for each record inserted when the original XML was generated might have saved a bit of time.
  • A handful of people could go from a flat file of names, positions, dates to about a thousand name-position pairs on Wikidata, some informative queries, and a prototype front-end viewer with a couple of days of work, and some of that could have been bypassed with cleaner initial data. This is really promising for

And on the Wikidata side, there are a few modelling questions this has thrown up:

  • I took the decision not to change postings based on the diplomatic rank – eg someone who was officially the “Minister to Norway” (1905-1942) conceptually held the same post as someone who was “Ambassador to Norway” (1942-2018). If desired, we can represent the rank as a qualifier on the item (eg/ subject has role: “chargé d’affaires”). This seemed to make the most sense – “ambassadors with a small ‘a'”.
  • The exception to this is High Commissioners, who are currently modelled parallel to Ambassadors – same hierarchy but in parallel. This lets us find all the HCs without simply treating them as “Ambassadors with a different job title”.

    However, this may not be a perfect approach as some HCs changed to Ambassadors and back again (eg Zimbabwe) when a country leaves/rejoins the Commonwealth. At the moment these are modelled by picking one for a country and sticking to it, with the option of qualifiers as above, but a better approach might be needed in the long run.
  • Dates as given are the dates of service. A few times – especially in the 19th century when journeys were more challenging – an ambassador was appointed but did not proceed overseas. These have been imported with no start-end dates, but this isn’t a great solution. Arguably they could have a start/end date in the same year and a qualifier to say they did not take up the post; alternatively, you could make a case that they should not be listed as ambassadors at all.

Lee of Portrush: an introduction

One of the projects I’ve been meaning to get around to for a while is scanning and dating a boxful of old cabinet photographs and postcards produced by Lee of Portrush in the late nineteenth and early twentieth century.

At least five members and three generations of the Lee family worked as professional photographers in this small Northern Irish town – the last of them was my grandfather, William Lee, who carried the business on into the 1970s. Their later output doesn’t turn up much – I don’t think I’ve run across anything post-1920s – but a steady trickle of their older photographs appear on ebay and on family history sites. They produced a range of monochrome and coloured postcards of Portrush and the surrounding area, did a good trade in portrait photographs, and at one point ended up proprietors of (both temperance and non-temperance) hotels. Briefly, one brother decamped to South Africa (before deciding to come home again) and they proudly announced “Portrush, Coleraine, and Cape Town” – a combination rarely encountered. A more unusual line of work, however, was that they had a studio at the Giant’s Causeway.

The Causeway is the only World Heritage Site in Northern Ireland, and was as popular a tourist attraction then as now. A narrow-gauge electric tramline was built out from Portrush to Bushmills and then the Causeway in the 1880s, bringing in a sharp increase in visitors. And – because the Victorians were more or less the same people as we are now – they decided there was no better way to respond to a wonder of the natural world than to have your photograph taken while standing on it, so that you can show it to all your friends. Granted, you had to pay someone to take the photo, sit still with a rictus grin, then wait for them to faff around with wet plates and developer; not quite an iPhone selfie, but the spirit is the same even if the subjects were wearing crinolines. There is nothing new in this world.

The Lees responded cheerfully to this, and in addition to the profitable postcard trade, made a great deal of money by taking photographs of tourists up from Belfast or Dublin, or even further afield. (They then lost it again over the years; Portrush was not a great place for long-term investment once holidays to the Mediterranean became popular.)

Many of these are sat in shoeboxes; some turn up occasionally on eBay, where I buy them if they’re a few pounds. It’s a nice thing to have, since so little else survives of the business. One problem is that very few are clearly dated, and as all parts of the family seem to have used “Lees Studio”, or a variant, it’s not easy to put them in order, or to give a historical context. For the people who have these as genealogical artefacts, this is something of a problem – ideally, we’d be able to say that this particular card style was early, 1880-1890, that address was later, etc., to help give some clues as to when it was taken.

Fast forward a few years. Last November, I had an email from John Kavanaugh, who’d found a Lee photograph of his great-great-grandfather (John Kavanagh, 1822-1904), and managed to recreate the scene on a visit to the Causeway:

Family resemblance, 1895-2015
Courtesy John Kavanaugh/Efren Gonzalez

It’s quite striking how similar the two are. The stone the elder John was sat on has now crumbled, fallen, or been moved, but the rock formations behind him are unchanged. The original photo is dated c. 1895, so this covers a hundred and twenty years and five generations.

So, taking this as a good impetus to get around to the problem, I borrowed a scanner yesterday and set to. Fifty-odd photographs later, I’ve updated the collection on flickr, and over the next few posts I’ll try and draw together some notes on how to date them.

Canadian self-reported birthday data

In the last post, we saw strong evidence for a “memorable date” bias in self-reported birthday information among British men born in the late 19th century. In short, they were disproportionately likely to think they were born on an “important day” such as Christmas.

It would be great to compare it to other sources. However, finding a suitable dataset is challenging. We need a sample covering a large number of men, over several years, and which is unlikely to be cross-checked or drawn from official documentation such as birth certificates or parish registers. It has to explicitly list full birthdates (not just month or year)

WWI enlistment datasets are quite promising in this regard – lots of men, born about the same time, turning up and stating their details without particularly much of a reason to bias individual dates. The main British records have (famously) long since burned, but the Australian and Canadian records survive. Unfortunately, the Australian index does not include dates of birth, but the Canadian index does (at least, when known). So, does it tell us anything?

The index is available as a 770mb+ XML blob (oh, dear). Running this through xmllint produces a nicely formatted file with approximately 575,000 birthdays for 622,000 entries. It’s formatted in such a way as to imply there may be multiple birthdates listed for a single individual (presumably if there’s contradictory data?), but I couldn’t spot any cases. There’s also about ten thousand who don’t have nicely formatted dd/mm/yyyy entries; let’s omit those for now. Quick and dirt but probably representative.

And so…

There’s clearly a bit more seasonality here than in the British data (up in spring, down in winter), but also the same sort of unexpected one-day spikes and troughs. As this is quite rough, I haven’t corrected for seasonality, but we still see something interesting.

The highest ten days are: 25 December (1.96), 1 January (1.77), 17 March (1.56), 24 May (1.52), 1 May (1.38), 15 August (1.38), 12 July (1.36), 15 September (1.34), 15 March (1.3).

The lowest ten days are: 30 December (0.64), 30 January (0.74), 30 October (0.74), 30 July (0.75), 30 May (0.78), 13 November (0.78), 30 August (0.79), 26 November (0.80), 30 March (0.81), 12 December (0.81).

The same strong pattern for “memorable days” that we saw with the UK is visible in the top ten – Christmas, New Year, St. Patrick’s, Victoria Day, May Day, [nothing], 12 July, [nothing], [nothing].

Two of these are distinctively “Canadian” – both 24 May (the Queen’s birthday/Victoria Day) and 12 July (the Orange Order marches) are above average in the British data, but not as dramatically as they are here. Both appear to have been relatively more prominent in late-19th/early-20th century Canada than in the UK. Canada Day/Dominion Day (1 July) is above average but does not show up as sharply, possibly because it does not appear to have been widely celebrated until after WWI.

One new pattern is the appearance of the 15th of the month in the top 10. This was suggested as likely in the US life insurance analysis and I’m interested to see it showing up here. Another oddity is leap years – in the British data, 29 February was dramatically undercounted. In the Canadian data, it’s strongly overcounted – just not quite enough to get into the top ten. 28 February (1.28), 29 February (1.27) and 1 March (1.29) are all “memorable”. I don’t have an explanation for this but it does suggest an interesting story.

Looking at the lowest days, we see the same pattern of 30/xx dates being very badly represented – seven of the ten lowest dates are 30th of the month…. and all from days where there were 31 days in the month. This is exactly the same pattern we observed in UK data, and I just don’t have any convincing reason to guess why. The other three dates all fall in low-birthrate months,

So, in conclusion:

  • Both UK and Canadian data from WWI show a strong bias for people to self-report their birthday as a “memorable day”;
  • “Memorable” days are commonly a known and fixed festival, such as Christmas;
  • Overreporting of arbitrary numbers like the 15th of the month are more common in Canada (& possibly the US?) than the UK;
  • The UK and Canadian samples seem to treat 29 February very differently – Canadians overreport, British people underreport;
  • There is a strong bias against reporting the 30th of the month particularly in months with 31 days

Thoughts (or additional data sources) welcome.

When do you think you were born?

Back in the last post, we were looking at a sample of dates-of-birth in post-WWI Army records.

(To recap – this is a dataset covering every man who served in the British Army after 1921 and who had a date of birth in or before 1900. 371,716 records in total, from 1864 to 1900, strongly skewed towards the recent end.)

I’d suggested that there was an “echo” of 1914/15 false enlistment in there, but after a bit of work I’ve not been able to see it. However, it did throw up some other very interesting things. Here’s the graph of birthdays.

Two things immediately jump out. The first is that the graph, very gently, slopes upwards. The second is that there are some wild outliers.

The first one is quite simple to explain; this data is not a sample of men born in a given year, but rather those in the army a few decades later. The graph in the previous post shows a very strong skew towards younger ages, so for any given year we’d expect to find marginally more December births than January ones. I’ve normalised the data to reflect this – calculated what the expected value for any given day would be assuming a linear increase, then calculated the ratio of reported to expected births. [For 29 February, I quartered its expected value]

There are hints at a seasonal pattern here, but not a very obvious one. January, February, October and November are below average, March and September above average, and the rest of the spring-summer is hard to pin down. (For quite an interesting discussion on “European” and “American” birth seasonality, see this Canadian paper)

The interesting bit is the outliers, which are apparent in both graphs.

The most overrepresented days are, in order of frequency, 1 January (1.8), 25 December (1.43), 17 March (1.33), 28 February (1.27), 14 February (1.22), 1 May (1.22), 11 November (1.19), 12 August (1.17), 2 February (1.15), and 10 October (1.15). Conversely, the most underrepresented days are 29 February (0.67 after adjustment), 30 July (0.75), 30 August (0.78), 30 January (0.81), 30 March (0.82), and 30 May (0.84).

Of the ten most common days, seven are significant festivals. In order: New Year’s Day, Christmas Day, St. Patrick’s Day, [nothing], Valentine’s Day, May Day, Martinmas, [nothing], Candlemas, [nothing].

Remember, the underlying bias of most data is that it tells you what people put into the system, not what really happened. So, what we have is a dataset of what a large sample of men born in late nineteenth century Britain thought their birthdays were, or of the way they pinned them down when asked by an official. “Born about Christmastime” easily becomes “born 25 December” when it has to go down on a form. (Another frequent artefact is overrepresentation of 1-xx or 15-xx dates, but I haven’t yet looked for this.) People were substantially more likely to remember a birthday as associated with a particular festival or event than they were to remember a random date.

It’s not all down to being memorable, of course; 1 January is probably in part a data recording artefact. I strongly suspect that at some point in the life of these records, someone’s said “record an unknown date as 1/1/xx”.

The lowest days are strange, though. 29 February is easily explained – even correcting for it being one quarter as common as other days, many people would probably put 28 February or 1 March on forms for simplicity. (This also explains some of the 28 February popularity above). But all of the other five are 30th of the month – and all are 30th of a 31-day month. I have no idea what might explain this. I would really, really love to hear suggestions.

One last, and possibly related, point – each month appears to have its own pattern. The first days of the month are overrepresented; the last days underrepresented. (The exception is December and possibly September). This is visible in both normalised and raw data, and I’m completely lost as to what might cause it…

Back to the Army again

In the winter of 1918-19, the British government found itself in something of a quandary. On the one hand, hurrah, the war was over! Everyone who had signed up to serve for “three years or the duration” could go home. And, goodness, did they want to go home.

On the other hand, the war… well it wasn’t really over. There were British troops fighting deep inside Russia; there were large garrisons sitting in western Germany (and other, less probable, places) in case the peace talks collapsed; there was unrest around the Empire and fears about Bolsheviks at home.

So they raised another army. Anyone in the army who volunteered to re-enlist got a cash payment of £20 to £50 (no small sum in 1919); two month’s leave with full pay; plus comparable pay to that in wartime and a separation allowance if he was married. Demobilisation continued for everyone else (albeit slowly), and by 1921, this meant that everyone in the Army was either a very long-serving veteran, a new volunteer who’d not been conscripted during wartime (so born 1901 onwards) or – I suspect the majority – re-enlisted men working through their few years service.

For administrative convenience, all records of men who left up to 1921 were set aside and stored by a specific department; the “live” records, including more or less everyone who reenlisted, continued with the War Office. They were never transferred – and, unlike the pre-1921 records, they were not lost in a bombing raid in 1940.

The MoD has just released an interesting dataset following an FOI request – it’s an index of these “live” service records. The records cover all men in the post-1921 records with a DoB prior to 1901, and thus almost everyone in it would have either remained in service or re-enlisted – there would be a small proportion of men born in 1900 who escaped conscription (roughly 13% of them would have turned 18 just after 11/11/18), and a small handful of men will have re-enlisted or transferred in much later, but otherwise – they all would have served in WWI and chosen to remain or to return very soon after being released.

So, what does this tell us? Well, for one thing, there’s almost 317,000 of them. 4,864 were called Smith, 3,328 Jones, 2,104 Brown, 1,172 Black, etc. 12,085 were some form of Mac or Mc. And there are eight Singhs, which looks like an interesting story to trace about early immigrants.

But, you know, data cries out to be graphed. So here’s the dates of birth.

Since the 1900 births are probably an overcount for reenlistments, I’ve left these off.

It’s more or less what you’d expect, but on close examination a little story emerges. Look at 1889/90; there’s a real discontinuity here. Why would this be?

Pre-war army enlistments were not for ‘the duration’ (there was nothing to have a duration of!) but for seven years service and five in the reserves. There was a rider on this – if war broke out, you wouldn’t be discharged until the crisis was over. The men born 1900 would have enlisted in 1908 and been due for release to the reserves in 1915. Of course, that never happened… and so, in 1919, many of these men would have been 29, knowing no other career than soldiering. Many would have been thrilled to get out – and quite a few more would have considered it, and realised they had no trade, and no great chance of good employment. As Kipling had it in 1894:

A man o’ four-an’-twenty what ‘asn’t learned of a trade—
Except “Reserve” agin’ him—’e’d better be never made.

It probably wasn’t much better for him in 1919.

Moving right a bit, 1896-97 also looks odd – this is the only point in the data where it goes backwards, with marginally more men born in 1896 than 1897. What happened here?

Anyone born before August 1896 was able to rush off and enlist at the start of the war; anyone born after that date would either have to wait, or lie. Does this reflect a distant echo of people giving false ages in 1914/15 and still having them on the paperwork at reenlistment? More research no doubt needed, but it’s an interesting thought.

Not all encyclopedias are created equal

Wikipedia has some way to go before it can comprehensively replace the great Britannica in all its many roles. From Shackleton’s South, a passage in which he and his crew are stranded on a drifting ice-floe in the Weddell Sea, November 1915:

In addition to the daily hunt for food, our time was passed in reading the few books that we had managed to save from the ship. The greatest treasure in the library was a portion of the “Encyclopaedia Britannica.” This was being continually used to settle the inevitable arguments that would arise. The sailors were discovered one day engaged in a very heated discussion on the subject of Money and Exchange. They finally came to the conclusion that the Encyclopaedia, since it did not coincide with their views, must be wrong.

“For descriptions of every American town that ever has been, is, or ever will be, and for full and complete biographies of every American statesman since the time of George Washington and long before, the Encyclopaedia would be hard to beat. Owing to our shortage of matches we have been driven to use it for purposes other than the purely literary ones though; and one genius having discovered that the paper, used for its pages had been impregnated with saltpetre, we can now thoroughly recommend it as a very efficient pipe-lighter.”

We also possessed a few books on Antarctic exploration, a copy of Browning and one of “The Ancient Mariner.” On reading the latter, we sympathized with him and wondered what he had done with the albatross; it would have made a very welcome addition to our larder.

Carolyn Mayben Flowers: the Lady Prospector of Porcupine

Working my way through some of the Canadian Collection on Commons this morning, I discovered a rather eye-catching picture:

Porcupine's lady prospector (HS85-10-24373)

“Porcupine’s Lady Prospector”, photographed at the Porcupine Gold Rush in the summer of 1911. Two things immediately strike the viewer: one is that the woman in the photograph is dressed decorously by the standards of Edwardian Canada, with a white blouse and a long dark skirt, despite the searing heat of that summer – Porcupine would later be devastated by wildfire – and the second is that she has a revolver slung casually on one hip.

There has to be a story here.

It turns out to be quite quick to put a name to her; the Timmins Daily Press captions a copy of the picture as Carolyn Mayben Flowers, and the Timmins Museum gives us still around in 1915, giving piano lessons. I haven’t been able to trace her after that, or indeed before. There is a “Cathaline Flowers” in Gowganda (aged 26, married, with a six-year-old daughter), but Gowganda is a long way from Timmins, and she doesn’t list herself as American…

Sitting in a tin can

Randall Munroe, in conclusion, on the Hadfield video:

While that’s far from the most anyone’s paid for a guitar, it’s certainly a lot of money. And if playing music helps the astronauts relax and keep from going crazy while they’re crammed together in a tin can for months at a time, it’s probably a worthwhile investment.

As it happens, NASA – are we surprised? – have long before tested the combination of astronauts, musical instruments, and confined spaces.

When the Apollo 11 crew returned from the moon, they were swiftly locked inside the “Mobile Quarantine Facility”, a pressurised trailer designed to stop any lunar diseases escaping. (They were joined by a doctor and a technician, and presumably everyone carefully ignored the aircraft carrier they would have contaminated en route). After a couple of days, they were transferred to a set of living quarters with twelve other people, and spent the next three weeks waiting to see what happened.

(With impeccable logic, it was ruled that if any of them contracted inexplicable urgent medical problems, they would be transferred out of quarantine and into a hospital, which somewhat defeated the point…)

However… well, three weeks in what was essentially a laboratory. No matter how careful the psychological screening, it’s a daunting thought. “[O]nly meager provision had been made for recreation,” according to the official history; they had a ping-pong table and a television.

And, so, some enterprising genius shipped Neil Armstrong a ukulele. It is not recorded what his colleagues thought of this, but in the picture below they do seem to have an avid interest in the sealed door…


Interior view of Mobile Quarantine Facility with Apollo 11 crewmembers [S69-40210]

On pennies

The BBC has an article on whether or not the UK may end up withdrawing the penny as too small.

What the article apparently has forgotten, when carefully noting the examples of Canada, Australia, Brazil and New Zealand, is that the UK has withdrawn currency for being too small – the halfpenny circulated until the end of 1984 (you still found a couple in the backs of drawers when I was small). (It’s not the only coin to have been withdrawn in living memory; the pre-decimal farthing was withdrawn in 1960 as too small.)

It’s informative to look at how little something had to be worth to be withdrawn then. Using RPI, in 1960, 1/4d was worth £0.0196 (2011 values). In 1984, 1/2p was worth £0.0131 (ditto). The penny is worth substantially less than either earlier coin was at the time of its withdrawal…