Caffenol

So I did something odd this weekend: I successfully developed film using £8.65 of stuff I bought at Morrisons in a panic.

In true recipe-blog style, the backstory: I got a new lens as an impulse purchase (20mm f/2.8 AF-D, it’s lovely, never used anything that wide before), and since I was off out for a walk thought I’d shoot a roll of film to try it out with. Got home, set up to develop it, found there was no developer. Wait! There was a new bottle of developer.

Neither Iona nor I could open it. We tried for ten minutes before admitting defeat.

But, wait once more! Caffenol! Which I had vaguely wanted to try one day. Rapid googling confirmed, yes, it was achievable. Clock said 22.40. Sprint to Morrisons, and find everything (including their one single bag of washing soda) just before they closed up.

First attempt a bust, but I went out the next day with a replacement film, and after a more careful reading of the instructions, these emerged on Sunday evening:

HOLLOWAY Police horses Dennis Queuing Lunch stall

In terms of the actual process, I used the Caffenol C-M recipe from DigitalTruth with a bit of crosschecking against the Caffenol blog on the complicated issue of what on earth actually constitutes “washing soda”.

The eventual mix I used (for my own reference as much as anything else):

  • 500ml water (only making enough for one film) at ~20C
  • Add 73g “washing soda” – the recipe calls for 27g but the stuff I got is actually “Sodium Carbonate Decahydrate”, which means about 63% by mass is water. Stir thoroughly until it is all dissolved
  • Add 8g vitamin C – I could not *find* pure vitamin C so I bought the highest value supplement I could find, 1000mg effervescent tablets. Made a surprising amount of reddish froth but a good stir dissolved it all neatly. (Each tablet was 4g so this presumably meant 24g of miscellaneous other crap ended up in the mix – probably mostly sugar)
  • Add 20g of the cheapest instant coffee Morrisons had on offer – their own-brand stuff, tucked away at the bottom of a shelf. Stir this, again, until all dissolved.
  • Let the mix stand for a good ten minutes at least. (I did this in a warming bath to make sure it was up to about 20-22C, since it was a cold room)

Then develop promptly, don’t let it stand too long – I tried to pour carefully to avoid any sediment from the tablets/coffee, but there didn’t seem to be much. For this film, the recommendation was 13m30, initial agitation then a couple of inversions every minute – a bit longer than I’m used to but sure. Drain – delighting in the fact that this is fine to just go down the sink – and rinse with running water as a stop bath. I gave it a little longer than usual for rinsing (maybe 1m30-2m?) since it was still coming out brown-tinged after a minute.

Finally, fix – I was very cautious here as my fixer had ended up a bit ominously tinted after the last botched attempt, and had been sat out all night, plus Bergger recommend a longer fixing time anyway. So I erred wildly on the side of caution and gave it nine minutes.

And it worked! Some very fogged images and I may try some of the variant recipes in future – adding a little potassium bromide or iodised salt helps with this apparently? But it is also possible that when we get better light, trying this with 100 speed film will work much more nicely.

Gender and BLPs on Wikipedia, redux

Back in 2019 I generated some data on how the English Wikipedia’s biographies of living people (BLPs) broke down by gender, and how that intersected with creation and deletion rates. The headline figures were that:

  • There was a significant difference between the gender split for all biographies, at 17.8% women – and for biographies of living people (BLPs), 22.7%.
  • In 2009, around 20% of existing BLPs were on women. As time went on, the average share of BLPs increased slowly, by perhaps a quarter of a percentage point per year.
  • In 2009, around 20% of newly created BLPs were on women. In about 2012, this kicked up a gear, rising above the long term average – first to about 25%, peaking around 33% before falling back a little.
  • BLP articles on women were more likely to be nominated for deletion until about 2017, when the effect disappeared.

One thing that was raised during the subsequent discussion was that a lot of the skew by gender was potentially linked to subject area – I was able to identify that for athletes (defined broadly, all sports players) the articles were much more likely to be men. I didn’t investigate this too much, though. Last week, I was reminded about this, and I’ve been looking at the numbers again. It brought up two interesting divergences.

Please don’t make me read all this

Okay – but you’ll miss the graphs. In summary:

English Wikipedia has more women in recent cohorts (about ~25% of living people born since the seventies) and there are far more men among athletes. Since the athletes make up a staggeringly high amount of articles among younger subjects, the gender split among non-athletes is much more balanced – a little under a third overall, but breaking 50% female among the younger cohorts.

Still with me? Let’s start. Sorry about the spoilers.

Time and tide

The first phenomenon is very straightforward: while the overall percentage across all people is around 25% women, how that is distributed over time varies. In general, there is a steady rise until about the 1970s; for those born from the 1970s onwards, the generation who are currently in their active working lives, the level is relatively stable at around 25% women.

The exception is those born in the 1920s (where it sits at 26%) – this is presumably affected by the fact that at this point, female life expectancy is significantly higher than male, and so the proportion of women begins to rise as a result.

One surprising outcome, however, is that the share of living people with no recorded age (green) is much more female than the average. This is a large cohort – there are in fact slightly more women in it than in any individual decade. I believe that it skews young – in other words, were this information known, it would increase the share of women in recent decades – but it is hard to find a way to confirm this. This issue is discussed in more detail below.

(Those born in the 2010s/20s and in the 1900s/10s are omitted – the four groups have a total of 175 articles, while the cohorts shown range from 5,000 to 170,000 – but the levels are around 50%. This is likely due to life expectancy in the oldest cohorts, and the fact that the people in the youngest cohorts are mostly notable at this point as being “the child of someone famous” – which you would broadly expect to be independent of gender.)

The percentages shown here are of the total male + female articles, but it is also possible to calculate the share of people who have a recorded gender that is not male/female. These show a very striking rise over time, though it should be cautioned that the absolute numbers are small – the largest single cohort is the 1980s with 345 people out of 170,000.

Sports by the numbers

The original question was to look at what the effect of athlete articles is on the overall totals. It turns out… very striking.

A few things are immediately apparent. The first is that the share of athletes is very substantial – it reflects only around a quarter of people born in the 1950s, but 85-90% of people born in the 1990s/2000s.

The second is that those athletes are overwhelmingly men – among the 1950s cohort, only about 10% of those athletes are female, and even by recent years it is only around 20%. This means that if we look purely at the non-athlete articles, the gender split becomes a lot more balanced.

Across all articles, it is around 32% female. But among living non-athletes, born since 1990, the gender balance is over 50% female.

This is a really amazing figure. I don’t think I ever particularly expected to see a gender analysis on Wikipedia that would break 50%. Granted, the absolute numbers involved are low – as is apparent from the previous graph, “non-athletes born in the 1990s” is around 22,000 people, and “born in the 2000s” is as low as 2,500 – but it’s a pretty solid trend and the total numbers for the earlier decades are definitely large enough for it to be no anomaly.

(Eagle-eyed readers will note that these do not quite align with the numbers in the original linked discussion – those were a couple of points lower in recent decades. I have not quite worked out why, but I think this was an error in the earlier queries; possibly it was counting redirects?)

One last detail to note: the “date missing” cohort comes out over 90% non-athletes. Presumably this is because their exact age is often significant and linked in to eg when they start professional sports, so it’s easily publicly available.

Methodology: the thousand word footnote

Feel free to let your eyes glaze over now.

These numbers were constructed mostly using the petscan tool, and leveraging data from both English Wikipedia and Wikidata. From Wikipedia, we have a robust categorisation system for year/decade of birth, and for whether someone is a living person. From Wikidata, we have fairly comprehensive gender data, which Wikipedia doesn’t know about. (It also has dates of birth, but it is more efficient to use WP categories here). So it is straightforward to produce intersection queries like “all living people marked as 1920s births and marked as female” (report). Note that this is crunching a lot of data – don’t be surprised if queries take a minute or two to run or occasionally time out.

To my surprise, the report for “living people known to be female” initially produced a reliable figure, but one for “living people known to be male” produced a figure that was an undercount. (I could validate this by checking against some small categories where I could run a report listing the gender of every item). The root cause seemed to be a timeout in the Wikidata query – I was originally looking for { ?item wdt:P31 wd:Q5 . wdt:P21 wd:Q6581097 } – items known to be human with gender male. Tweaking this to be simply { ?item wdt:P21 wd:Q6581097 } – items with gender male – produced a reliable figure. Similarly, we had the same issue when trying to get a total for all items with reported gender – simply { ?item wdt:P21 ?val } works.

Percentages are calculated as percentage of the number of articles identified as (male + female), rather than of all BLPs with a recorded gender value or simply of all BLPs. There are good arguments for either of the first two, but the former is simpler (some of my “any recorded gender value” queries timed out) and also consistent with the 2019 analysis.

A thornier problem comes from the sports element. There are a number of potential ways we could determine “sportiness”. The easiest option would be to use Wikidata occupation and look for something that indicates their occupation is some form of athlete, or that indicates a sport being played. The problem is that this is too all-encompassing, and would give us people who played sports but for whom it is not their main claim to fame. An alternative is to use the Wikipedia article categorisation hierarchy, but this is very complex and deep, making the queries very difficult to work with. The category hierarchy includes a number of surprise crosslinks and loops, meaning that deep queries tend to get very confusing results, or just time out.

The approach I eventually went with was to use Wikipedia’s infoboxes – the little standardised box on the top right of a page. There are a wide range of distinct infobox templates tailored to specific fields; each article usually only displays one, but can embed elements of others to bring in secondary data. If we look for articles using one of the 77(!) distinct sports infoboxes (report), we can conclude they probably had a significant sporting career. An article that does not contain one can be inferred to not have a sporting background.

But then we need to consider people with significant sports and non-sports careers. For example, the biographies of both Seb Coe and Tanni Grey-Thompson use the “infobox officeholder” to reflect their careers in Parliament being more recent, but it is set up to embed a sports infobox towards the end. This would entail them being counted as athletes by our infobox method. This is probably correct for those two, but there are no doubt people out there where we would draw the line differently. (To stay in the UK political theme: how about Henry McLeish? His athletic career on its own would probably just qualify for a Wikipedia biography, but it is, perhaps, a bit of a footnote compared to being First Minister…)

So, here is another complication. How reliable is our assumption that an athlete has a sports infobox, and that non-athletes don’t? If it’s broadly true, great, our numbers hold up. If it’s not, and if it’s not in some kind of systematic way, there might be a more complex skew. I believe that for modern athletes, it’s reasonably safe to assume that infoboxes are nearly ubiquitous; there are groups of articles where they’re less common, but this isn’t one of them. However, I can’t say for sure; it’s not an area I’ve worked intensively in.

Finally, we have the issue of dates. We’ve based the calculation on Wikipedia categories. Wikipedia birth/death categories are pretty reliably used where that data is known. However, about 150k (14%) of our BLP articles are marked “year of birth unknown”, and these are disproportionately female (35.4%).

What effect do these factors have?

Counting the stats as percentage of M+F rather than percentage of all people with recorded gender could be argued either way, but the numbers involved are quite low and do not change the overall pattern of the results.

The infobox question is more complicated. It is possible that it is meaning we are not picking up all athletes because they do not have infoboxes. On the other hand, it is possible that it is meaning we are being more expansive in counting people as athletes because they have a “secondary” infobox along the line of Coe & Grey-Thompson above. The problem there is defining where we draw the line, and what level of “other significance” stops someone being counted. That feels like a very subjective threshold and hard to test for automatically. It is certainly a more conservative test than a Wikidata-based one, at least.

And for dates, hmm. We know that the articles that do not report an age are disproportionately female (35% vs the BLP average of 25%), but also that they are even more disproportionately “not athletes” (7% athletes vs the BLP average of 43%). There are also a lot of articles that don’t report an age; around 14% of all BLPs.

This one probably introduces the biggest question mark here. Depending on how that 14% break down, it could change the totals for the year-by-year cohorts; but there’s not really much we can do at the moment to work that out.

Anecdotally, I suspect that they are more likely to skew younger rather than being evenly distributed over time, but there is very little to go on here. However, I feel it is unlikely they would be distributed in such a way as to counteract the overall conclusions – this would require, for example, the female ones being predominantly shifted into older groups and the male ones into younger groups. It’s possible, but I don’t see an obvious mechanism to cause that.

[Edit 4/8/23 – tweaked to confirm that these are English Wikipedia only figures, after a reminder from Hilda. I would be very interested in seeing similar data for other projects, but the methodology might be tricky to translate – eg French and German do not have an equivalent category for indexing living people, and different projects may have quite different approaches for applying infoboxes.]

on chatGPT

I hear the way to start blogposts is with an engaging personal anecdote leading into what you want to write about.

As it happens, a few years ago, I had a blocked toilet. After – I thought – fixing it, I woke the next morning to what seemed like an endless stream of crud welling up from it, spilling out as fast as I could bail it into the bath by hand.

It turned out that there was a blockage in the main drainpipe – wetwipes, it’s always wetwipes – and when our neighbours upstairs had a shower, the inexorable mathematics of gravity meant we got a continuous flow of warm, soapy, raw sewage welling up. And then I had to spend several hundred pounds plus two days of my life on dealing with it.

Which leads me onto ChatGPT. The more I look at it, the more I can’t help but see it as a useful and interesting tool … that is also giving us a direct pipe of warm soapy sewage.

This is a tool that simply does not have a conception of factuality. It produces bland, plausible, passages of text and gives them to us, without any way to indicate what is true, what is ambiguous, what is filler. And it has no conception of whether that filler should be meaningful.

For some purposes, this is fine. It can write straightforward code at a really competent level – better than I can. I asked it to generate a short python script for a project I was working on, to align various events and time periods. It generated a functioning sample script, nicely laid out with clear variable names, in less time than I then spent fiddling the parser to use the right date formats. I could probably have worked that out, knowing my own capabilities, in … about two hours.

Code, great. Score one for the machine.

Descriptive text? It writes a perfectly nice descriptive passage on demand:

> give me a short atmospheric description of a castle in Scotland in winter

The castle looms dark and foreboding against the snowy landscape, its stone walls adorned with frost and icicles. The wind howls through the turrets and the bare trees that surround the fortress, creating an eerie, otherworldly atmosphere. Inside, the fire crackles and pops, casting flickering shadows on the walls as the snow continues to fall outside.

It is of course helped by the fact that this castle does not exist, so it just has to describe somewhere plausible sounding.

Once we start nailing it down to the real world a little, it gets shakier. Every place you ask about, if it’s somewhere a traveller might go, you get tourist bumph, because that’s the sort of thing people write about Places Abroad.

[Solan] is located in the foothills of the Himalayas and has a pleasant climate throughout the year. It is a popular destination for tourists who come to enjoy the scenic beauty of the area and to explore the many temples and historical sites in and around the city. Overall, Solan is a beautiful and culturally rich city that offers visitors a unique and authentic experience of northern India.

Anchorage is known for its stunning natural beauty, with majestic mountains and glaciers, pristine lakes and rivers, and abundant wildlife. It is a popular destination for outdoor enthusiasts, who come to hike, fish, ski, and explore the wild and remote landscape. […] Overall, Anchorage is a vibrant and dynamic city that offers visitors a unique and unforgettable experience of Alaska’s natural beauty and cultural richness.

The area around Cluanie is known for its stunning natural beauty, with rugged mountains, vast expanses of moorland, and crystal-clear lochs. It is a popular destination for outdoor enthusiasts, who come to hike, climb, fish, and explore the wild and remote landscape. […] Overall, Cluanie is a beautiful and peaceful destination that offers visitors a chance to experience the natural beauty and rich cultural heritage of the Scottish Highlands.

[Berlin] is home to a diverse and thriving arts and culture scene, with numerous theaters, music venues, and performance spaces showcasing a wide range of genres and styles. Berlin is also known for its vibrant nightlife, with countless bars, clubs, and restaurants offering a diverse and eclectic mix of cuisine and entertainment. […] Overall, Berlin is a vibrant and dynamic city that offers visitors a rich and varied cultural experience, as well as a glimpse into Germany’s rich history and contemporary society.

Overall, Skelmersdale is a modern, vibrant town that offers visitors a range of recreational, shopping, and dining experiences. With its strong community spirit and range of community events, it is a welcoming and inclusive place to explore.

I am sure most of this is true, none of it is really falsifiable, but we’ve also read so much of this non-specific praise over the years that our eyes basically tune it out as background noise. Everywhere has “stunning natural beauty”, is “vibrant” (pace the indignant “Skem is not vibrant!” coming from next door), or if you really want to get fancy, it’s “culturally rich”.

You can get it to tell you some places are not great – but even then they have “good transport links” or, for somewhere really suffering, “While the town faces many challenges, it is home to a strong community spirit and a range of community initiatives aimed at improving the lives of its residents.” Got to accentuate the positive! Is it true? Well, I guess everywhere has a range of community initiatives.

Moving onto things where more of the statements are concrete and falsifiable – an actual person with an actual life who did actual things – it really starts hiccuping. I asked it to tell me about someone reasonably obscure, but where a reasonably good Wikipedia article currently exists (and has done for some time) –

Ernest Brooks (1878-1957) was a British photographer and a pioneer of photojournalism. He is known for his photographs taken during World War I, which documented the life of soldiers on the front lines.

Brooks was born in Worcestershire, England, and began his career as a photographer in 1900, working as an assistant to the portrait photographer Elliott & Fry. In 1902, he joined the Royal Navy as a photographer, where he gained experience in capturing naval scenes.

During World War I, Brooks was appointed as the official photographer for the British Army’s Western Front. He captured powerful images of soldiers in the trenches, the devastation of the battlefield, and the impact of war on civilians. His photographs were widely published in newspapers and magazines, and helped to shape public opinion about the war.

After the war, Brooks continued to work as a photographer and was commissioned to photograph the coronation of King George VI in 1937. He was also a founding member of the Royal Photographic Society and served as its president from 1941 to 1944.

Today, Brooks’ photographs from World War I are considered some of the most important images of the conflict, providing a powerful and moving insight into the experiences of soldiers and civilians alike.

It’s not bad – the core of it, his WWI work, is basically correct (though I’m not sure about some of the superlatives) but paragraphs two and four are … nonsense. Completely made up. I asked it to regenerate the article four times, and got these bits around the WWI section (which was mostly consistent)

Brooks began his career as a photographer in 1900, working for the Daily Mirror newspaper. He covered a variety of events and subjects, including royal weddings and funerals, but he became particularly interested in photojournalism during the First World War. […] After the war, Brooks continued to work as a photographer, covering events such as the coronation of King George VI and the wedding of Prince Albert (later King George VI) and Lady Elizabeth Bowes-Lyon. He also took portraits of famous people, including Albert Einstein and Mahatma Gandhi.

Brooks was born in Hampshire, England and studied photography in London before joining the Royal Navy as a photographer in 1901. […] After the war, Brooks continued to work as a photographer, producing images of famous figures such as Winston Churchill and Mahatma Gandhi. He also worked as a cinematographer, filming documentaries and newsreels.

Brooks was born in Worcestershire, England in 1876 and began his career as a photographer in the early 1900s. […] After the war, Brooks continued to work as a photographer and founded the Ernest Brooks Studio in London. He was awarded the Order of the British Empire for his services to photography in 1920.

Brooks began his career as a photographer in the 1890s, working for several newspapers and magazines in London. In 1908, he was appointed as the official photographer to the Royal Family, and he remained in this position until 1916. […]  After the war, Brooks continued to work as a photographer, covering events such as the coronation of King George VI and the funeral of King George V. He was also a member of the Royal Photographic Society and served as its president from 1932 to 1934.

The third of these is almost OK, but it is noticeable that all of these are very different – and also generally quite distinct from his actual post-war career, which saw him sacked from his royal post, bankruptcy, and then quietly disappearing. But all five options are presented confidently, without any caveating, and they all sound absolutely plausible. None of them seem to be confusing him with someone else of the same name – they’re all the sort of careers he might have had, but didn’t. 

If we ask it to name some of his work, it gives me the names of things that sound like a WWI photograph might be called – “A Group of Soldiers” – This photograph shows a group of soldiers huddled together in a trench, and it has been described as “one of the most powerful images of the war.”  … “A Working Party of the Middlesex Regiment” (1917): This image shows a group of soldiers carrying supplies and equipment through a devastated landscape.

Those photographs don’t exist – I’m sure by happy coincidence there are photographs in his work that fit those descriptions, he was prolific and they’re fairly standard imagery, but his photos were not usually named and I cannot find any reference to these under those names. It’s just generating words that feel like they fit into the right places in the story it’s been asked to tell. 

I tried it on something a little more well-known; the early life of Tony Blair. In this case, the five options more or less agreed with each other, though occasionally a little erratic on the details – but it gets absolutely hung up on one point.

…where he was also a member of the Oxford University Labour Club

…at Oxford, he became involved in politics and joined the Labour Party

…joining the Labour Party and later serving as the president of the Oxford University Labour Club.

…where he was a member of the Oxford University Labour Club and became interested in left-wing politics.

…where he was elected president of the Oxford University Conservative Association.

It is very striking that this one point which it gets reliably wrong is the one which should be right for most people like him – many people who become politicians have this sort of backstory (and it usually is OULC/OUCA…). However, Blair didn’t – he didn’t join the student groups, and he didn’t join the party until he left university and moved to London.

It’s not a very important detail, but it’s illuminating – it’s not got anything specific to say about his time at Oxford, but it knows what you usually say in the bit between “studied at Oxford” and “elected to Parliament” so it plugs it in.

Let’s now try it on something with a straight (if obscure) answer. Since I’d just been looking it up for another purpose – 

> what were the start and end dates of all parliamentary sessions in the UK since 1900

The start and end dates of parliamentary sessions in the UK since 1900 are as follows:

  • 1900-1901: 6th February 1900 to 21st December 1900
  • 1901-1902: 14th February 1901 to 12th December 1901
  • 1902-1903: 6th February 1902 to 19th December 1902
  • 1903-1904: 12th February 1903 to 17th December 1903
  • 1904-1905: 10th February 1904 to 21st December 1904

[regenerated]

The UK Parliament has had multiple sessions throughout the years since 1900. Here are the start and end dates for each parliamentary session:

  • 1900: 6 February – 19 December
  • 1901: 15 February – 4 December
  • 1902: 11 February – 11 December
  • 1903: 10 February – 18 December
  • 1904: 9 February – 16 December

It’s a plausible pattern, but none of those dates are correct in either version. But, again, there’s no caveatting and no indication you might ask it twice and get completely different answers. There’s no opportunity for filler here, so we don’t really have anything that’s “vaguely right” in the way we did before – save that those are mostly about the right time of year. Again, if asked for a source, it gives a plausible sounding page that doesn’t exist.

You can get it to say “X didn’t happen”, but only if given a very specific query:

> what happened in the anglo-vietnamese war of 1808

I’m sorry, but there was no Anglo-Vietnamese War in 1808. The first direct contact between Britain and Vietnam occurred in the early 19th century, but it was limited to sporadic trading contacts and diplomatic missions.

> what are the ferries linking Scotland and Norway

There are currently no regular ferry services linking Scotland and Norway. However, there have been some occasional ferry services in the past, particularly during the summer months, connecting ports such as Aberdeen in Scotland and Bergen or Stavanger in Norway.

Once it knows that the topic exists, however, it just goes for it.

> what are the ferries linking Scotland and the Netherlands

There are currently two regular ferry services linking Scotland and the Netherlands:

  1. DFDS Seaways: DFDS Seaways operates a daily ferry service between Rosyth, near Edinburgh in Scotland, and the Dutch port of Rotterdam. The crossing takes around 20 hours and operates both ways, meaning you can travel from Scotland to the Netherlands and vice versa.
  2. P&O Ferries: P&O Ferries also operates a regular ferry service between Scotland and the Netherlands. Their service operates between the Scottish port of Cairnryan and the Dutch port of Rotterdam or Zeebrugge. The crossing takes around 17 hours and operates both ways.

One of these exists, the other doesn’t – P&O do indeed have ferries at both Cairnryan and Zebrugge, but they don’t sail between them.

> more details on the Cairnryan Rotterdam ferry

The Cairnryan to Rotterdam ferry is operated by P&O Ferries, and it is a popular route connecting Scotland and the Netherlands. Here are some additional details about the ferry service:

  • The ferry service runs daily, with two sailings per day in each direction. The crossings take around 20-24 hours, depending on the direction of travel and the specific sailing.
  • The ferry departs from Cairnryan, which is located in southwestern Scotland, and arrives at the port of Europoort in Rotterdam, which is one of the largest ports in the Netherlands. From Europoort, it is easy to access other destinations in the Netherlands, such as Amsterdam, The Hague, and Utrecht.

So where do we end up?

  1. It can give a straightforward solution to a coding problem or a very well understood factual question 
  2. It can say no, if it recognises there’s no valid answer
  3. It will give a (usually broadly) correct answer if there is one, but cannot stop itself extending the response with things that sound like they might plausibly be true, either with factual details it has invented, or with low-meaning filler text.
  4. It will double down on that – because it has no sense of truth it will invent plausible sounding details based on what it has said before.

It’s just merrily spitting out answers, regardless of their validity, with a little disclaimer at the bottom saying “ChatGPT may produce inaccurate information about people, places, or facts”.

We can’t say “please only give an answer you are confident about” or “mark the bits which are most reliable”. We can’t say “please check that is correct” or “please give us the sources you drew from”, because it has no conception of sources or external validity. It’s just something that sounds like an answer, and it’s happy luck that it mostly gets something that coincides with reality.

None of this is to say the tool isn’t useful in the right contexts. It can do amazing things with language and style; it can transform bits of text into each other very well. It seems to be pretty reliable for transforming straightforward process descriptions into code. But it cannot be trusted to answer factual questions, and it’s baffling to me that it’s being allowed to confidently do so without any attempt at guard-rails – and then merrily being promoted as a search and question-answering tool.

This stuff is going to get everywhere, unremarked; we’re going to have years to come of it cropping up everywhere, sounding confident but containing negative informational value. To go back to our original analogy – this pipe definitely has a valuable purpose, but right now it’s coughing up an unexpectedly large amount of informational raw sewage, and the people upstairs saying “hey, we’ve got a great new shower” don’t seem to know or care that there’s going to be a lot of buckets and bleach needed to sort it out.

(The simmering disquiet that led to this post was prompted by discovering someone had been writing Wikipedia articles using it. They were terrible…)

Lee of Portrush: a timeline

Five years ago I wrote about the Lee family of photographers, and foolishly used the phrase “over the next few posts”. Ha. One change of job, one house move, and one pandemic later, I’ve finally found the time to start putting it together…

Harbour Terrace and Causeway View – early 1880s

Robert Lee (the elder) was born in Antrim circa 1840. He married at Ballycastle in 1859, and emigrated to Lancashire, where he lived in Barrow until at least 1875, working as a joiner. In 1881 his wife Margaret was keeping a boarding house in Blackpool along with their two younger children, but he had disappeared from the census and, presumably, returned to Ireland.

He had started working as a photographer by at least 1884, when he is known to have been at Harbour Terrace in Portrush (and renting out a house in Ballycastle); shortly afterwards he moved to Causeway View, and produced photographs with a backing along these lines. The “Art Studio” name was used sometimes, but not always.

The name “Causeway View” continues in directories until around 1901, but the studio seemed to bill itself as Lansdowne Crescent or Lansdowne from 1890 onwards – see below.

Portrait of young boy (P047)

Giant’s Causeway – circa 1887 to circa 1896

By 1887, Robert had opened a studio at the Giant’s Causeway, though he may have been operating there earlier; there are some photographs of groups at the Causeway which use backings as above, without mentioning the studio. However, he soon shifted to labelling them “Portrush and Giant’s Causeway“, as below.

Group at the Giant's Causeway (P042)

Around 1890, he took his sons Robert (born c.1863) and Alexander (born c.1869) into the business, meaning that cards are often marked “Lee & Son” or just generically “Lee”, and began to focus his own attentions on developing hotels. The studio at the Giant’s Causeway was primarily operated by the younger son, Alexander, and was leased until 1896. His daughter Henrietta (born c. 1862) may also have entered the business around this point.

The Portrush studio was still at Causeway View until it was destroyed by a storm in 1888, and then rebuilt as the “Tower House“, adjacent to Lee’s other business, the (temperance) Lansdowne Hotel; both opened around 1890. The rebuilt Causeway View studio became known as the Lansdowne Studio, sometimes just Lansdowne.

Cape Town – circa 1895 to circa 1908

In the late 1890s, following the death of his wife in 1894, Robert Lee (the younger) moved to South Africa and established a business there, in the Electric Studio, 54 Plein Street, Cape Town. Cards from this period (from both ends of the world) often state “Portrush and Cape Town”, possibly with the Giant’s Causeway or Coleraine added as well. At some point his brother Alexander moved out as well; they continued in business for several years (and presumably through the Boer War) before returning by 1907/08, apparently after Alexander fell ill. The brothers married (in Robert’s case, remarried) and settled back in Portrush, where Robert ran the studio at Lansdowne Crescent, and Alexander established a studio at Bath Street. The two brothers continued in business, along with their sister & brother-in-law in Main Street, despite a major falling out in 1914.

Coleraine – 1898 to 1908?

Robert Lee built a terrace of houses on Railway Road, Coleraine, in 1897, one of which was opened as a studio in 1898. The properties were inherited by his son Alexander in 1901. The studio may not have been operated by a member of the family – advertisements suggest it was subcontracting to other photographers. Alexander Lee retained property here at least into the 1930s and the studio is last found in directories in 1913.

Bath Street, Portrush – circa 1910 to circa 1930?

This studio was in use from at least 1910 to 1925. It was the studio of Alexander Lee, and presumably was only opened after he returned from South Africa (circa 1907-8?). It may also have used the name “Bath Cottage”. Some photographs were taken of groups at the Giant’s Causeway and printed under this label, but there may not have been a studio there. Alexander did not pass the business on, and it closed at some point in his later years; he and his wife also ran a boarding house at the adjacent Leander House on Bath Street.

Family group at the Giant's Causeway (P004)

Lansdowne, Portrush – circa 1890 to circa 1915?

This name was used for the studio which succeeded Causeway View, on Lansdowne Crescent. It was in use from around 1890 onwards. Until 1901 it was operated by Robert Lee the elder, then inherited by his son Robert along with the adjoining Lansdowne House Hotel. The studio stopped being used at some point in the 1910s, and was itself leased out as a hotel (“Tower House”); this was operating as a hotel in 1915 and 1917, suggesting the studio had stopped operating by this point. Robert Lee then moved the family and the business to Main Street (below).

Main Street / L’Atelier, Portrush – circa 1900 to 1970s

The house and studio at 2 Main Street was built by Robert Lee probably in the late 1890s; it is not clear when it was put to use as a studio, but an advertisement in 1899 is looking for a shop assistant with experience in “fancy goods”. A contemporary image suggests that it sold a large amount of tourist souvenirs whilst also operating as a studio.

The property was inherited by his daughter Henrietta Leech in 1901, and in 1901 she and her husband were shown as occupying it and working as photographers. It continued to be occupied by her until it was sold to her brother Robert, who took over business there; this was by circa 1920. (Her husband had died in 1916 and her younger son, who had been expected to inherit he business, in 1917).

The name “L’Atelier” is identified as early as 1909, but it is not clear exactly when it started being used – probably after 1905. This studio was the last part of the business to work as a photography studio, and the name was retained right through to the 1970s.

Portrait of couple (P038)

Great James Street, Londonderry – 1925-27?

Very little is known about this studio – “Messrs. Lee of Portrush” took over a studio here in 1925, and were still operating it by Christmas 1927. The “Messrs” is slightly odd given that the two brothers had seemingly fallen out by this point, so it is not clear if they were working together. The card below suggests it was run from Alexander Lee’s side, however.

Studio portrait of young girl (P002)

Any comments below will be replied to by email if possible! I am always delighted to see new photographs from the Lees, and will see what I can do to help you date them. Please do get in touch.

Industrial Library

A little while ago, I ran across the intriguing “Industrial Library”, a series of small cheap hardback volumes (priced between a shilling and half-a-crown) published intermittently between the 1830s and 1880s. Each volume focused on a particular trade, generally giving a mixture of an outline of what the work might involve, some inevitable moralising about the need for good work ethic and religious devotion, and practical advice.

As a result, some of them are amazing fragments of social history – the “Housemaid” volume, for example, gives a detailed description of the daily routine of a servant in a moderately sized household, and just what some mundane tasks like cleaning candle-sticks might actually entail. (The didacticism is helpfully provided by an unexpected page explaining why servant girls really shouldn’t be spending their money on unwarranted luxuries like tea.)

A few of them are a bit differently oriented – the “Confectioner” volume, for example, is mostly a set of recipes and processes. (How to make chocolate starting from a bag of ripe cocoa beans – no problem, pages 35-37, make sure to have a roasting pot and an iron mortar and pestle handy. And if you need somewhere to store it afterwards, the later editions tell you how to build an ice-house.)

After a bit of digging, it seems the series was first produced in 1838 onwards (or so), as the “Guide to Service” and “Guide to Trade” series (collectively described as the “Industrial Series”, though this name was not on the volumes). This work was done under the auspices of the Poor Law Commissioners, who apparently wanted to help train children, particularly in workhouses, into various useful occupations. The series was published by Charles Knight – the publisher behind the Penny Magazine and Penny Cyclopedia, and a collaborator of the Society for the Diffusion of Useful Knowledge, so very much the sort of publisher keen on cheap-but-useful publications for a working-class audience.

I am not completely sure how many volumes were in these initial series, but I suspect most of the ones that we see later were initially published here. There is at least one volume (“The Farmer“, ed. George Nicholls, 1844) that fits the pattern of the “Industrial Series”, but does not reappear in the later issues and does not have a “Guide to Trade” label. An early advertisment for the series (published at the end of another Knight book, 1838) identified 31 volumes for “service” and 56 or 59 for “trade”, and anticipated publishing one of each per month, though there is a good chance not all of them were ever published, and a few (such as “Banker’s clerk” or “Governess”) were not mentioned here, but did in the end go on to be published in the first series.

Most volumes were unsigned, but we know some of the authors to be interesting figures. Knight himself wrote the “Publisher” volume, and his frequent collaborator, the writer and social theorist Harriet Martineau, wrote the “Dressmaker”, “Housemaid”, “Lady’s maid”, and “Maid of all work” volumes. James Devlin, who wrote the two “Shoemaker” volumes (the original edition, at least, seems to have been long enough to require two parts) later seems to have been influential in the early shoemakers unions and a prominent Chartist organiser, and wrapping up the set, Sir George Stephen, a prominent abolitionist, wrote the “Clerk”, “Governess”, and “Groom” volumes. An intriguing cross-section of early Victorian left-wing intellectuals.

Circa 1849, (some of?) these volumes were reissued by Houlston & Stoneman under the label of the “Industrial Library”, which seems to have been used on the covers and in advertising, but not the title pages. (It is possible that some of the Houlston volumes were new and not from the Knight series; I haven’t traced all of them yet. Intriguingly they seem to have used the same printer.)

They were all reissued in c. 1862-65 by Houlston & Wright, now as a coherent set with a series index listing 34 numbered volumes of the “Industrial Library”, priced between 1s and 2s6d (a bit less than the previous Houlston series). With the possible exception of “The Farmer” (noted above), all the volumes from the Knight series seem to reappear here.

Finally, in 1877-78, the entire series was reissued, again under the “Industrial Library” banner, by Houlston and Sons. These had a series device on the titlepage, and an appropriate quotation from Ecclesiastes (“Whatsoever thy hand findeth to do, do it with thy might”. Hard to argue.) They were also published with nice embossed “Industrial Library” covers, a little gilding on the title, and end-papers advertising other useful and improving Houlston titles.

(It is possible that some individual volumes were reissued independently of these series; the “Banker’s clerk” volume in the 1870s series is described as a fifth edition, not a fourth. This volume is also the only one I have definitely identified being republished by Houlston after the 1870s or early 1880s; there seems to have been a seventh edition in 1891.)

The fourth series, the 1870s publications, also had some slight tweaks to the line-up. One volume was removed outright – No. 28, “Ploughman”, was dropped and replaced with a new (out-of-sequence) No. 28, “Draper and haberdasher”, which probably says something interesting about what was seen to be important in the modern commercial world. No. 30 was also tweaked to be “Poultry keeper”, not “Poultry maid”, an unexpected moment of occupational gender-neutrality.

And thanks to the wonders of mass digitisation, of course, we can now read them. I think that 14 of the 35 total (34 plus one extra in 1878) have freely-available online editions; another three are available through Gale; and five are on HathiTrust or Google Books but currently unavailable due to presumed copyrights. That leaves us with 22 of 35 digitised in some form or another, which for something quite ephemeral like this is not bad at all.

In order, the currently available volumes are (updated!):

I will update this if I manage to track down open copies of any more of them. (Edit: “Governess”, “Nurse”, and “Tailor” added 11/1/20; “Butler” added 16/1/20.)


Gender and deletion on Wikipedia

So, a really interesting question cropped up this weekend:

I’m trying to find out how many biographies of living persons exist on the English Wikipedia, and what kind of data we have on them. In particular, I’m looking for the gender breakdown. I’d also like to know when they were created; average length; and whether they’ve been nominated for deletion.

This is, of course, something that’s being discussed a lot right now; there is a lot of emerging push-back against the excellent work being done to try and add more notable women to Wikipedia, and one particular deletion debate got a lot of attention in the past few weeks, so it’s on everyone’s mind. And, instinctively, it seems plausible that there is a bias in the relative frequency of nomination for deletion – can we find if it’s there?

My initial assumption was, huh, I don’t think we can do that with Wikidata. Then I went off and thought about it for a bit more, and realised we could get most of the way there of it with some inferences. Here’s the results, and how I got there. Thanks to Sarah for prompting the research!

(If you want to get the tl;dr summary – yes, there is some kind of difference in the way older male vs female articles have been involved with the deletion process, but exactly what that indicates is not obvious without data I can’t get at. The difference seems to have mostly disappeared for articles created in the last couple of years.)

Statistics on the gender breakdown of BLPs

As of a snapshot of yesterday morning, 5 May 2019, the English Wikipedia had 906,720 articles identified as biographies of living people (BLPs for short). Of those, 697,402 were identified as male by Wikidata, 205,117 as female, 2464 had some other value for gender, 1220 didn’t have any value for gender (usually articles on groups of people, plus some not yet updated), and 517 simply didn’t have a connected Wikidata item (yet). Of those with known gender, it breaks down as 77.06% male, 22.67% female, and 0.27% some other value. (Because of the limits of the query, I didn’t try and break down those in any more detail.)

This is, as noted, only articles about living people; across all 1,626,232 biographies in the English Wikipedia with a gender known to Wikidata, it’s about 17.83% female, 82.13% male, and 0.05% some other value. I’ll be sticking to data on living people throughout this post, but it’s interesting to compare the historic information.

So, how has that changed over time?

BLPs by gender and date of creation

This graph shows all existing BLPs, broken down by gender and (approximately) when they were created. As can be seen, and as might be expected, the gap has closed a bit over time.

Percentage of BLPs which are female over time

Looking at the ratio over time (expressed here as %age of total male+female), the relative share of female BLPs was ~20% in 2009. In late 2012, the rate of creation of female BLPs kicked up a gear, and from then on it’s been noticeably above the long-term average (almost hitting 33% in late 2017, but dropping back since then). This has driven the overall share steadily and continually upwards, now at 22.7% (as noted above).

Now the second question, do the article lengths differ by gender? Indeed they do, by a small amount.

BLPs by current article size and date of creation

Female BLPs created at any time since 2009 are slightly longer on average than male ones of similar age, with only a couple of brief exceptions; the gap may be widening over the past year but it’s maybe too soon to say for sure. Average difference is about 500 bytes or a little under 10% of mean article size – not dramatic but probably not trivial either. (Pre-2009 articles, not shown here, are about even on average)

Note that this is raw bytesize – actual prose size will be smaller, particularly if an article is well-referenced; a single well-structured reference can be a few hundred characters. It’s also the current article size, not size at creation, hence why older articles tend to be longer – they’ve had more time to grow. It’s interesting to note that once they’re more than about five years old they seem to plateau in average length.

Finally, the third question – have they been nominated for deletion? This was really interesting.

Percentage of BLPs which have previously been to AFD, by date of creation and gender

So, first of all, some caveats. This only identifies articles which go through the structured “articles for deletion” (AFD) process – nomination, discussion, decision to keep or delete. (There are three deletion processes on Wikipedia; the other two are more lightweight and do not show up in an easily traceable form). It also cannot specifically identify if that exact page was nominated for deletion, only that “an article with exactly the same page name has been nominated in the past” – but the odds are good they’re the same if there’s a match. It will miss out any where the article was renamed after the deletion discussion, and, most critically, it will only see articles that survived deletion. If they were deleted, I won’t be able to see them in this analysis, so there’s an obvious survivorship bias limiting what conclusions we can draw.

Having said all that…

Female BLPs created 2009-16 appear noticeably more likely than male BLPs of equivalent age to have been through a deletion discussion at some point in their lives (and, presumably, all have been kept). Since 2016, this has changed and the two groups are about even.

Alongisde this, there is a corresponding drop-off in the number of articles created since 2016 which have associated deletion discussions. My tentative hypothesis is that articles created in the last few years are generally less likely to be nominated for deletion, perhaps because the growing use of things like the draft namespace (and associated reviews) means that articles are more robust when first published. Conversely, though, it’s possible that nominations continue at the same rate, but the deletion process is just more rigorous now and a higher proportion of those which are nominated get deleted (and so disappear from our data). We can’t tell.

(One possible explanation that we can tentatively dismiss is age – an article can be nominated at any point in its lifespan so you would tend to expect a slowly increasing share over time, but I would expect the majority of deletion nominations come in the first weeks and then it’s pretty much evenly distributed after that. As such, the drop-off seems far too rapid to be explained by just article age.)

What we don’t know is what the overall nomination for deletion rate, including deleted articles, looks like. From our data, it could be that pre-2016 male and female articles are nominated at equal rates but more male articles are deleted; or it could be that pre-2016 male and female articles are equally likely to get deleted, but the female articles are nominated more frequently than they should be. Either of these would cause the imbalance. I think this is very much the missing piece of data and I’d love to see any suggestions for how we can work it out – perhaps something like trying to estimate gender from the names of deleted articles?

Update: Magnus has run some numbers on deleted pages, doing exactly this – inferring gender from pagenames. Of those which were probably a person, ~2/3 had an inferred gender, and 23% of those were female. This is a remarkably similar figure to the analysis here (~23% of current BLPs female; ~26% of all BLPs which have survived a deletion debate female)

So in conclusion

  • We know the gender breakdown: skewed male, but growing slowly more balanced over time, and better for living people than historical ones.
  • We know the article lengths; slightly longer for women than men for recent articles, about equal for those created a long time ago.
  • We know that there is something different about the way male and female biographies created before ~2017 experience the deletion process, but we don’t have clear data to indicate exactly what is going on, and there are multiple potential explanations.
  • We also know that deletion activity seems to be more balanced for articles in both groups created from ~2017 onwards, and that these also have a lower frequency of involvement with the deletion process than might have been expected. It is not clear what the mechanism is here, or if the two factors are directly linked.

How can you extract this data? (Yes, this is very dull)

The first problem was generating the lists of articles and their metadata. The English Wikipedia category system lets us identify “living people”, but not gender; Wikidata lets us identify gender (property P21), but not reliably “living people”. However, we can creatively use the petscan tool to get the intersection of a SPARQL gender query + the category. Instructing it to explicitly use Wikipedia (“enwiki” in other sources > manual list) and give output as a TSV – then waiting for about fifteen minutes – leaves you with a nice clean data dump. Thanks, Magnus!

(It’s worth noting that you can get this data with any characteristic indexed by Wikidata, or any characteristic identifiable through the Wikipedia category schema, but you will need to run a new query for each aspect you want to analyse – the exported data just has article metadata, none of the Wikidata/category information)

The exported files contain three things that are very useful to us: article title, pageid, and length. I normalised the files like so:

grep [0-9] enwiki_blp_women_from_list.tsv | cut -f 2,3,5 > women-noheader.tsv

This drops the header line (it’s the only one with no numeric characters) and extracts only the three values we care about (and conveniently saves about 20MB).

This gives us two of the things we want (age and size) but not deletion data. For that, we fall back on inference. Any article that is put through the AFD process gets a new subpage created at “Wikipedia:Articles for deletion/PAGENAME”. It is reasonable to infer that if an article has a corresponding AFD subpage, it’s probably about that specific article. This is not always true, of course – names get recycled, pages get moved – but it’s a reasonable working hypothesis and hopefully the errors are evenly distributed over time. I’ve racked my brains to see if I could anticipate a noticeable difference here by gender, as this could really complicate the results, but provisionally I think we’re okay to go with it.

To find out if those subpages exist, we turn to the enwiki dumps. Specifically, we want “enwiki-latest-all-titles.gz” – which, as it suggests, is a simple file listing all page titles on the wiki. Extracted, it comes to about 1GB. From this, we can extract all the AFD subpages, as so:

grep "Articles_for_deletion/" enwiki-latest-all-titles | cut -f 2 | sort | uniq | cut -f 2 -d / | sort | uniq > afds

This extracts all the AFD subpages, removes any duplicates (since eg talkpages are listed here as well), and sorts the list alphabetically. There are about 424,000 of them.

Going back to our original list of articles, we want to bin them by age. To a first approximation, pageid is sequential with age – it’s assigned when the page is first created. There are some big caveats here; for example, a page being created as a redirect and later expanded will have the ID of its initial creation. Pages being deleted and recreated may get a new ID, pages which are merged may end up with either of the original IDs, and some complicated page moves may end up with the original IDs being lost. But, for the majority of pages, it’ll work out okay.

To correlate pageID to age, I did a bit of speculative guessing to find an item created on 1 January and 1 July every year back to 2009 (eg pageid 43190000 was created at 11am on 1 July 2014). I could then use these to extract the articles corresponding to each period as so:

...
awk -F '\t' '$2 >= 41516000 && $2 < 43190000' < men-noheader.tsv > bins/2014-1-M
awk -F '\t' '$2 >= 43190000 && $2 < 44909000' < men-noheader.tsv > bins/2014-2-M
...

This finds all items with a pageid (in column #2 of the file) between the specified values, and copies them into the relevant bin. Run once for men and once for women.

Then we can run a short report, along these lines (the original had loops in it):

  cut -f 1 bins/2014-1-M | sort > temp-M
  echo -e 2014-1-M"\tM\t"`cat bins/2014-1-M | wc -l`"\t"`awk '{ total += $3; count++ } END { print total/count }' bins/2014-1-M`"\t"`comm -1 -2 temp-M afds | wc -l` >> report.tsv

This adds a line to the file report.tsv with (in order) the name of the bin, the number of entries in it, the mean value of the length column, and a count of the number which also match names in the afds file. (The use of the temp-M file is to deal with the fact that the comm tool needs properly sorted input).

After that, generating the data is lovely and straightforward – drop the report into a spreadsheet and play around with it.

George Ernest Spero, the vanishing MP

As part of the ongoing Wikidata MPs project, I’ve come across a number of oddities – MPs who may or may not have been the same person, people who essentially disappear after they leave office, and so on. Tracking these down can turn into quite a complex investigation.

One such was George Ernest Spero, Liberal MP for Stoke Newington 1923-24, then Labour MP for Fulham West 1929-30. His career was cut short by his resignation in April 1930; shortly afterwards, he was declared bankrupt. Spero had already left the country for America, and nothing more was heard of him. The main ambiguity was when he died – various sources claimed either 1960 or 1976, but without it being clear which was more reliable, or any real details on what happened to him after 1930. In correspondence with Stephen Lees, who has been working on an incredibly useful comprehensive record of MP’s death-dates, I did some work on it last year and eventually confirmed the 1960 date; I’ve just rediscovered the notes from this and since it was an interesting little mystery, thought I’d post them.

George Spero, MP and businessman

So, let’s begin with what we know about him up to the point at which he vanished.

George Ernest Spero was born in 1894. He began training at the Royal Dental Hospital in 1912, and served in the RNVR as a surgeon during the First World War. He had two brothers who also went into medicine; Samuel was a dentist in London (and apparently also went bankrupt, in 1933), while Leopold was a surgeon or physician (trained at St. Mary’s, RNVR towards the end of WWI, still in practice in the 1940s). All of this was reasonably straightforward to trace, although oddly George’s RNVR service records seem to be missing from the National Archives.

After the war, he married Rina Ansley (nee Rina Ansbacher, born 14 March 1902) in 1922; her father was a wealthy German-born stockbroker, resident in Park Lane, who had naturalised in 1918. They had two daughters, Rachel Anne (b. 1923) and Betty Sheila (b. 1928). After his marriage, Spero went into politics in Leicester, where he seems to have been living, and stood for Parliament in the 1922 general election. The Nottingham Journal described him as for “the cause of free, unfettered Liberalism … Democratic in conviction, he stands for the abolition of class differences and for the co-operation of capital and labour.” However, while this was well-tailored to appeal to the generally left-wing voters of Leicester West, and his war record was well-regarded, the moderate vote was split between the Liberal and National Liberal candidates, with Labour taking the seat.

The Conservative government held another election in 1923, aiming to strengthen a small majority (does this sound familiar?), and Spero – now back in London – contested Stoke Newington, then a safe Conservative seat, again as a left Liberal. With support from Labour, who did not contest the seat, Spero ran a successful campaign and unseated the sitting MP. He voted in support of the minority Labour government on a number of occasions, and was one of the small number of Liberal rebels who supported them in the final no-confidence vote. However, this was not enough to prevent Labour fielding a candidate against him in 1924; the Conservative candidate took 57% of the vote, with the rest split evenly between Labour and Liberal.

Spero drifted from the Liberals into the Labour Party, probably a more natural home for his politics, joining it in 1925. By the time of the next general election, in May 1929, he had become the party’s candidate for Fulham West, winning it from the Conservatives with 45% of the vote.

He was a moderately active Government backbencher for the next few months, including being sent as a visitor to Canada during the recess in September 1929, travelling with his wife. While overseas, she caused some minor amusement to the British papers after reporting the loss of a £6,000 pearl necklace – they were delighted to report this alongside “socialist MP”. He was last recorded voting in Hansard in December, and did not appear in 1930. In February and March he was paired for votes, with a newspaper report in early March stating that he had been advised to take a rest to avoid a complete nervous breakdown about the start of the year, and had gone to the South of France, but “hopes to return to Parliament before the month is out”. However, on 9th April he formally took the Chiltern Hundreds (it is interesting that a newspaper report suggested his local party would choose whether to accept the resignation).

However, things were moving quickly elsewhere. A case was brought against him in the High Court for £10,000, arising from his sale of a radio company in 1928-29. During the court hearing, at the end of May, it was discovered that a personal cheque for £4000 given by Spero to guarantee the company’s debts had been presented to his bank in October 1929, but was not honoured. He had at this point claimed to be suing the company for £20,000, buying six months legal delay, sold his furniture, and – apparently – left the country for America. Bankruptcy proceedings followed later that year (where he was again stated to be in America) and, unsurprisingly, his creditors seem to have received very little.

At this point, the British trail and the historic record draw to a gentle close. But what happened to him?

The National Portrait Gallery gave his death as 1960, while an entry in The Palgrave Dictionary of Anglo-Jewish History reported that they had traced his death to 1976 in Belgrade, Yugoslavia (where, as a citizen, it was registered with the US embassy). Unfortunately, it did not go into any detail about how they worked this out, and this just heightened the mystery – if it was true, how had a disgraced ex-MP ended up in Yugoslavia on a US passport three decades later? And, conversely, who was it had died in 1960?

George Spears, immigrant and doctor

We know that Spero went to America in 1929-30; that much seemed to be a matter of common agreement. Conveniently, the American census was carried out in April 1930, and the papers are available. On 18 April, he was living with his family in Riverside Drive, upper Manhattan; all the names and ages line up, and Spero is given as a medical doctor, actively working. Clearly they were reasonably well off, as they had a live-in maid, and it seems to be quite a nice area.

In 1937, he petitioned for American citizenship in California, noting that he had lived there since March 1933. As part of the process, he formally notified that he intended to change his name to George Ernest Spears. (He also gave his birthdate as 2 March 1894, of which more later).

While we can be reasonably confident these are the same man due to the names and dates of the family, the match is very neatly confirmed by the fact that the citizenship papers have a photograph, which can be compared to an older newspaper one. There is fifteen years difference, but we can see the similarities between the prospective MP of 27 and the older man of 43.

George Spears, with the same family, then reappears in the 1940 census, back in Riverside Drive. He is now apparently practicing as an optician, and doing well – income upwards of $6000. Finally, we find a draft record for him living in Huntingdon, Long Island at some point in 1942. Note his signature here, which is visibly the same hand as in 1937, except “E. Spears” not “Ernest Spero”.

It is possible he reverted to his old name for a while – there are occasional appearances of a Dr. George Spero, optometrist, in the New York phone books between the 1940s and late 1950s. Not enough detail to be sure either way, though.

So at this point, we can trace Spero/Spears continually from 1930 to 1942. And then nothing, until on 7 January 1960, George E. Spears, born 2 March 1894, died in California. Some time later, in June 1976, George Spero, born 11 April 1894, died in Belgrade, Yugoslavia, apparently a US citizen. Which one was our man?

The former seemed more likely, but can we prove it? The death details come from an index, which gives a mother’s maiden name of “Robinson” – unfortunately the full certificate isn’t there and I did not feel up to trying to track down a paper Californian record to see what else it said.

If we return to the UK, we can find George Spero in the 1901 census in Dover, with his parents Isidore Sol [Solomon], a ‘dental mechanic’, and Rachel, maiden name unknown. The family later moved to London, the parents naturalised, Isidore died in 1925 – and probate goes to “George Ernest Spero, physician”, which seems to confirm that this is definitely the right family and not a different George Spero. The 1901 censuses note that two of the older children were born in Dublin, so we can trace them in the Irish records. Here we have an “Israel S Spero” marrying Rachel Robinson in 1884, and a subsequent child born to Solomon Israel Spero and Rachel Spero nee Robinson. There are a few other Speros or Spiros appearing in Dublin, but none married around the right time, and none with such similar names. If Israel Solomon Spero is the same as Isidore Solomon Spero, this all ties up very neatly.

It leaves open the mystery, however, of who died in Yugoslavia. It seems likely this was a completely different man (who had not changed his name), but I have completely failed to trace anything about him. A pity – it would have been nice to definitively close off that line of enquiry.

What’s in a name? MPs and their preferred titles

A quick skim of the list of members in Hansard shows that there is no consistency in how it refers to politicians – some are Ms Jane Smith, others are merely John Brown.

My understanding – I welcome corrections! – is that this is ultimately personal choice. MPs are asked to choose how they are described in Hansard, with the option for a title. (I am not sure quite how this process works, but I assume there is a form; there always is.) This decision eventually percolates through to all of the data produced by Parliament. Of course, “personal choice” might just be “whatever they [or their assistant] happened to think was expected when filling out the form”, rather than a conscious and deliberate choice.

So, what do people do? A 2010 Commons factsheet says vaguely that “A few Members of both sexes have requested that no title be used (e.g. Jennifer Jones MP” but a cursory glance down the list shows it’s more than “a few”. It turns out the full data is available from data.parliament.uk (as a big blob of XML) and so we can actually do some stats on this.

For current MPs, 145 of 650 have a preferred title (based on current data not past preferences). 33 Sir, 6 Dame, 17 Dr, 7 Ms, 11 Mrs, 71 Mr. Overall, 78% of MPs do not have a preferred title.

Of those, 44 are Labour, 93 Conservative, 3 LD, 2 SNP, 2 DUP, 1 Independent. So 17% of Labour MPs have a preferred title and 29% of Conservatives. Split by gender, 15% of women (32 of 209) list a title, versus 25% of men (113 of 441).

Of course, in some circumstances you don’t really have a choice – it would be a bit odd to say “I’d rather not be Sir X” once you’ve accepted a knighthood. Omitting anyone who’s a knight or dame, it becomes 20% of Conservatives & 15% of Labour having preferred titles, & overall 20% of men and 13% of women. The general proportions are broadly the same but the Labour-Conservative gap has narrowed a bit.

Doctors are an interesting question. Some PhD’ed MPs make a point of using their doctorate, but many others don’t. (They are in good company if so – the world’s most prominent doctorate-holding politician doesn’t, either). A couple of years ago, Chris Brooke tried to track down every current MP with a PhD. I took his list (with post-2017 updates), and a paper in the BMJ listing new medical MPs after the last election, and pulled together a total of 31 MPs who could be “Dr” – 21 have PhDs (or similar), 10 are medical doctors of various forms. (I have counted one with a D.Clin.Psy as “medical” rather than “like a PhD”). We’ve seen that only 17 people list themselves as Dr – who are they?

It turns out that every medical doctor uses “Dr” as their title, but only a third of PhDs (7 of 21). Two of the PhDs are “Sir”, but didn’t appear to use the title before getting knighthoods, and one sticks firmly to “Mr”; the rest are blank.

Across the parties, the Conservatives have six medics (all Dr) and 11 PhDs (three Dr); Labour have two medics (all Dr) and eight PhDs (four Dr). Not really enough to say anything confident about the difference between the parties.

Lastly, there’s the question of the change over time. Interestingly Paul Seaward noted that in the 1990s, the trend was for new doctoral MPs to use “Dr” for a few months and then quietly drop it.

The raw XML includes a note of the change of style by date since c. 2010 (presumably so that you can check you’re using a time-appropriate form if needed). It’s a bit noisy because it seems to have a lot of back-and-forth changes around election dates, which probably hints at changes not purely initiated by the Members themselves. Given this complicating the data I’d be cautious about drawing any conclusions from it without much more careful examination, but perhaps in a few years time we can start saying things about whether Members’ titles are indeed becoming gradually less common, or if it turns out that mostly not using them is a fashion that comes and goes…

At-risk content on Flickr

Flickr has recently announced that it will be cutting back storage for its free accounts; as of early 2019, they will be limited to 1000 images, and any files beyond that limit will be progressively deleted.

Personally speaking, this surprised me a little bit, because I’d forgotten they’d removed the 200-image limit a few years ago. I am generally quite comfortable with the idea of them imposing a capacity limit and charging to go beyond that; it’s a fair way to price your service, and ultimately, it has to be paid for. But retroactive deletion is a bit unfortunate (especially if handled as an abrupt guillotine).

A few people raised the reasonable question – how much material is now at risk? A huge chunk of Wikimedia Commons material is sourced from Flickr (imported under free licenses) and, in addition, there is the reasonably successful Flickr Commons program for image hosting from cultural institutions.

Looking at the 115 Flickr Commons accounts shows that there are ~480,000 images from the 54 Pro accounts, and ~6,450,000 from the 61 non-Pro accounts. This seems a very dramatic difference, but on closer examination the British Library and Internet Archive (both non-Pro accounts) make up the vast majority of this, with ~6,350,000 images, mostly extracts from digitized book images. Flickr have since stated that Flickr Commons accounts will not be affected (it will be interesting to see if they now expand the program to include many of the other institutional accounts).

For “normal” users, it’s a bit harder to be sure. Flickr state that “the overwhelming majority of Pros have more than 1,000 photos on Flickr, and more than 97% of Free members have fewer than 1,000”. But from the Commons perspective, what we really want to know is “what proportion of the kind thing we want to import is at risk?” Looking at this type of material is potentially quite interesting – it goes beyond the simple “Flickr as a personal photostore” and into “Flickr as a source of the cultural commons”.

So, analysis time! I pulled a list of all outbound links from Commons. For simplicity, I didn’t try to work out which of these were links from file pages as opposed to navigational/maintenance/user pages, but a quick sanity-check suggests that the vast majority of pages with outbound Flickr links are file descriptions – something like 99.7% – so it seems reasonable to just take the whole lot. I then extracted any flickr userIDs I could find, either in links to author profiles or in image URLs themselves, (eg 12403504@N02), and deduplicated the results so we ended up with a pile of userID-page pairs. The deduplication was necessary because a raw count of links can get quite confusing – some of the Internet Archive imports can have 20-30 links per file description page, and one of the British Library map maintenance pages has 9500…

One critical omission here is that I only took “raw” userIDs, not pretty human-readable ones (like “britishlibrary”); this was for practical reasons because I couldn’t easily link the two together. Many items are only linked with human-readable labels in the URLs, but ~96% of pages with an outbound Flickr link have at least one identifiable userID on them, so hopefully the remaining omissions won’t skew the results too much. (I also threw out any group IDs at this point to avoid confusion.)

I used this to run two analyses. One was the most frequently used userIDs – this was the top 5021 userIDs in our records, any ID that had links from approximately ~80 pages or more. The other was a random sample of userIDs – 5000 randomly selected from the full set of ~79000. With each sample, I used the number of links on Commons as a proxy for the number of images (which seems fair enough).

Among the most frequently used source accounts, I found that 50% of images came from Pro accounts, 35% from “at risk” free accounts (more than 1000 images), 3% from “safe” free accounts (under 1000 images), 11% from Flickr Commons (both pro & non-Pro), and 1% were from accounts that are now deactivated or have no images.

In the random sample, I found a somewhat different spread – 60% of images were from Pro accounts, 32% from “at risk” free accounts, 6% from “safe” free accounts, 2% Flickr Commons, and 0.25% missing.

Update: an extended sample of all accounts with ten or more links (19374 in total) broadly resembles the top 5000 – 49% Pro accounts, 35% “at risk” free accounts, 4.5% “safe” free accounts, 10% Flickr Commons accounts, and 1.5% missing.

So, some quick conclusions –

  • Openly-licensed material gathered from Flickr is a significant source for Commons – something like 7.5m file description pages link to Flickr, almost certainly as a source, about 15% of all files
  • A substantial amount of material sourced from Flickr comes from a relatively small number of accounts, some institutional and some personal (this was the most common one in my random sample – 58k images)
  • A substantial portion of our heavily used Flickr source accounts are potentially at risk (note that it is not possible to tell how many were once Pro, have lapsed because why bother when it’s free, and may resume paying)
  • It is not as catastrophic as it might at first appear – the samples all suggest that only about a third of potential source images are at risk, once the Flickr Commons accounts are exempted from the limits – which seems to be the plan.
  • Having said that, the figure of 97% of individual free accounts having under a thousand images is no doubt accurate, but probably masks the sheer number of images in many of the larger accounts.

Some things that would potentially still be very interesting to know –

  • What proportion of freely-licensed images are from at-risk accounts?
  • What proportion of images in at-risk accounts are actually freely-licensed?
  • What proportion of freely-licensed images on Flickr have (or could) be transferred over to Commons?
  • Are Flickr Commons accounts exempt from the size restriction? (As there are only ~150 of them, this seems plausible as a special case…)

Our man in Havana (or, Q56761118)

Has it really been a year since I posted here? Oh, dear. Well. So, this Friday/Saturday I went to the FCO’s hackathon event – wonderfully titled “Who Was Our Man In Havana?” – to have a play with a dataset of British diplomats.

My goal was to try and synch this up with Wikidata in some way – there were obviously some overlaps with the MPs project, but given how closely tied the diplomatic service has been into the establishment, it seemed likely there would be a lot of overlap. The objective of the event was to produce some kind of visualisation/interface, so after a bit of discussion with my team-mates we decided to get the data cleaned up, import some into Wikidata, and pull it out again in an enriched fashion.

The data cleaning was a bit of a challenge. Sev and Mohammed, my team-mates, did excellent work hacking away at the XML and eventually produced a nice, elegantly-parsed, version of the source data.

I uploaded this into Magnus’s mix-and-match tool, using a notional ID number which we could tie back to the records. Hammering away at mix-and-match that evening got me about 400 initial matches to work with. While I was doing this, Sev and Mohammed expanded the XML parsing to include all the positions held plus dates, tied back to the notional IDs in mix-and-match.

On Saturday, I wrote a script to pull down the mix-and-match records, line them up with the expanded parsing data, and put that into a form that could be used for QuickStatements. Thankfully, someone had already established a clear data model for diplomatic positions, so I was able to build on that to work out how to handle the positions without having to invent it from scratch.

The upload preparation was necessarily a messily manual process – I ended up compromising with a script generating a plain TSV which I could feed into a spreadsheet and then manually lookup (eg) the relevant Wikidata IDs for positions. If I’d had more time we could have put together something which automatically looked up position IDs in a table and then produced a formatted sheet (or even sent it out through something like wikidata-cli, but I wanted a semi-manual approach for this stage so I could keep an eye on the data and check it was looking sensible. (Thanks at this point also to @tagishsimon, who helped with the matching and updating on mix-and-match). And then I started feeding it in, lump by lump. Behold, success!

While I was doing this, Mohammed assembled a front-end display, which used vue.js to format and display a set of ambassadors drawn from a Wikidata SPARQL query. It concentrated on a couple of additional things to demonstrate the enrichment available from Wikidata – a picture and some notes of other non-ambassadorial positions they’d held.

To go alongside this, as a demonstration of other linkages that weren’t exposed in our tool, I knocked up a couple of quick visualisations through the Wikidata query tool: a map of where British ambassadors to Argentina were born (mainly the Home Counties and India!), or a chart of where ambassadors/High Commissioners were educated (Eton, perhaps unsurprisingly, making a good showing). It’s remarkable how useful the query service is for whipping up this kind of visualisation.

We presented this on Saturday afternoon and it went down well – we won a prize! A bottle of wine and – very appropriately – mugs with the famed Foreign Office cat on them. A great weekend, even if it did mean an unreasonably early Saturday start!

So, some thoughts on the event in conclusion:

  • It was very clear how well the range of skills worked at an event like this. I don’t think any of us could have produced the result on our own.
  • A lot of time – not just our group, but everyone – was spent parsing and massaging the (oddly structured) XML. Had the main lists been available as a CSV/TSV, this might have been a lot quicker. I certainly wouldn’t have been able to get anywhere with it myself.
  • On the data quality note, we were lucky that the names of records were more or less unique strings, but an ID number for each record inserted when the original XML was generated might have saved a bit of time.
  • A handful of people could go from a flat file of names, positions, dates to about a thousand name-position pairs on Wikidata, some informative queries, and a prototype front-end viewer with a couple of days of work, and some of that could have been bypassed with cleaner initial data. This is really promising for

And on the Wikidata side, there are a few modelling questions this has thrown up:

  • I took the decision not to change postings based on the diplomatic rank – eg someone who was officially the “Minister to Norway” (1905-1942) conceptually held the same post as someone who was “Ambassador to Norway” (1942-2018). If desired, we can represent the rank as a qualifier on the item (eg/ subject has role: “chargé d’affaires”). This seemed to make the most sense – “ambassadors with a small ‘a'”.
  • The exception to this is High Commissioners, who are currently modelled parallel to Ambassadors – same hierarchy but in parallel. This lets us find all the HCs without simply treating them as “Ambassadors with a different job title”.

    However, this may not be a perfect approach as some HCs changed to Ambassadors and back again (eg Zimbabwe) when a country leaves/rejoins the Commonwealth. At the moment these are modelled by picking one for a country and sticking to it, with the option of qualifiers as above, but a better approach might be needed in the long run.
  • Dates as given are the dates of service. A few times – especially in the 19th century when journeys were more challenging – an ambassador was appointed but did not proceed overseas. These have been imported with no start-end dates, but this isn’t a great solution. Arguably they could have a start/end date in the same year and a qualifier to say they did not take up the post; alternatively, you could make a case that they should not be listed as ambassadors at all.