Posts Tagged ‘wikidata’

History of Parliament and Wikidata – the first round complete

Sunday, August 14th, 2016

Back in January, I wrote up some things I was aiming to do this year, including:

Firstly, I’d like to clear off the History of Parliament work on Wikidata. I haven’t really written this up yet (maybe that’s step 1.1) but, in short, I’m trying to get every MP in the History of Parliament database listed and crossreferenced in Wikidata. At the moment, we have around 5200 of them listed, out of a total of 22200 – so we’re getting there. (Raw data here.) Finding the next couple of thousand who’re listed, and mass-creating the others, is definitely an achievable task.

Well, seven months later, here’s where it stands:

  • 9,372 of a total 21,400 (43.7%) of History of Parliament entries been matched to records for people in Wikidata.
  • These 9,372 entries represent 7,257 people – 80 have entries in three HoP volumes, and 1,964 in two volumes. (This suggests that, when complete, we will have about ~16,500 people for those initial 21,400 entries – so maybe we’re actually over half-way there).
  • These are crossreferenced to a lot of other identifiers. 1,937 of our 7,257 people (26.7%) are in the Oxford Dictionary of National Biography, 1,088 (15%) are in the National Portrait Gallery database, and 2,256 (31.1%) are linked to their speeches in the digital edition of Hansard. There is a report generated each night crosslinking various interesting identifiers.
  • Every MP in the 1820-32 volume (1,367 of them) is now linked and identified, and the 1790-1820 volume is now around 85% complete. (This explains the high showing for Hansard, which covers 1805 onwards)
  • The metadata for these is still limited – a lot more importing work to do – but in some cases pretty decent; 94% of the 1820-32 entries have a date of death, for example.

Of course, there’s a lot more still to do – more metadata to add, more linkages to make, and so on. It still does not have any reasonable data linking MPs to constituencies, which is a major gap (but perhaps one that can be filled semi-automatically using the HoP/Hansard links and a clever script).

But as a proof of concept, I’m very happy with it. Here’s some queries playing with the (1820-32) data:

  • There are 990 MPs with an article about them in at least one language/WM project. Strikingly, ten of these don’t have an English Wikipedia article (yet). The most heavily written-about MP is – to my surprise – David Ricardo, with articles in 67 Wikipedias. (The next three are Peel, Palmerston, and Edward Bulwer-Lytton).
  • 303 of the 1,367 MPs (22.1%) have a recorded link to at least one other person in Wikidata by a close family relationship (parent, child, spouse, sibling) – there are 803 links, to 547 unique people – 108 of whom are also in the 1820-32 MPs list, and 439 of whom are from elsewhere in Wikidata. (I expect this number to rise dramatically as more metadata goes in).
  • The longest-surviving pre-Reform MP (of the 94% indexed by deathdate, anyway) was John Savile, later Earl of Mexborough, who made it to August 1899…
  • Of the 360 with a place of education listed, the most common is Eton (104), closely followed by Christ Church, Oxford (97) – there is, of course, substantial overlap between them. It’s impressive to see just how far we’ve come. No-one would ever expect to see anything like that for Parliament today, would we.
  • Of the 1,185 who’ve had first name indexed by Wikidata so far, the most popular is John (14.4%), then William (11.5%), Charles (7.5%), George (7.4%), and Henry (7.2%):

  • A map of the (currently) 154 MPs whose place of death has been imported:

All these are of course provisional, but it makes me feel I’m definitely on the right track!

So, you may be asking, what can I do to help? Why, thankyou, that’s very kind…

  • First of all, this is the master list, updated every night, of as-yet-unmatched HoP entries. Grab one, load it up, search Wikidata for a match, and add it (property P1614). Bang, one more down, and we’re 0.01% closer to completion…
  • It’s not there? (About half to two thirds probably won’t be). You can create an item manually, or you can set it aside to create a batch of them later. I wrote a fairly basic bash script to take a spreadsheet of HoP identifiers and basic metadata and prepare it for bulk-item-creation on Wikidata.
  • Or you could help sanitise some of the metadata – here’s some interesting edge cases:
    • This list is ~680 items who probably have a death date (the HoP slug ends in a number), but who don’t currently have one in Wikidata.
    • This list is ~540 people who are titled “Honourable” – and so are almost certainly the sons of noblemen, themselves likely to be in Wikidata – but who don’t have a link to their father. This list is the same, but for “Lord”, and this list has all the apparently fatherless men who were the 2nd through 9th holders of a title…

Wikidata and identifiers – part 2, the matching process

Thursday, November 27th, 2014

Yesterday, I wrote about the work we’re doing matching identifiers into Wikidata. Today, the tools we use for it!


The main tool we’re using is a beautiful thing Magnus developed called mix-and-match. It imports all the identifiers with some core metadata – for the ODNB, for example, this was names and dates and the brief descriptive text – and sorts them into five groups:

  • Manually matched – these matches have been confirmed by a person (or imported from data already in Wikidata);
  • Automatic – the system has guessed these are probably the same people but wants human confirmation;
  • Unmatched – we have no idea who these identifiers match to;
  • No Wikidata – we know there is currently no Wikidata match;
  • N/A – this identifier shouldn’t match to a Wikidata entity (for example, it’s a placeholder, a subject Wikidata will never cover, or an cross-reference with its own entry).

The goal is to work through everything and move as much as possible to “manually matched”. Anything in this group can then be migrated over to Wikidata with a couple of clicks. Here’s the ODNB as it stands today:

(Want to see what’s happening with the data? The recent changes link will show you the last fifty edits to all the lists.)

So, how do we do this? Firstly, you’ll need a Wikipedia account, and to log in to our “WiDaR” authentication tool. Follow the link on the top of the mix-and-match page (or, indeed, this one), sign in with your Wikipedia account if requested, and you’ll be authorised.

On to the matching itself. There’s two methods – manually, or in a semi-automated “game mode”.

How to match – manually

The first approach works line-by-line. Clicking on one of the entries – here, unmatched ODNB – brings up the first fifty entries in that set. Each one has options on the left hand side – to search Wikidata or English Wikipedia, either by the internal search or Google. On the right-hand side, there are three options – “set Q”, to provide it with a Wikidata ID (these are all of the form Q—–, and so we often call them “Q numbers”); “No WD”, to list it as not on Wikidata; “N/A”, to record that it’s not appropriate for Wikidata matching.

If you’ve found a match on Wikidata, the ID number should be clearly displayed at the top of that page. Click “set Q” and paste it in. If you’ve found a match via Wikipedia, you can click the “Wikidata” link in the left-hand sidebar to take you to the corresponding Wikidata page, and get the ID from there.

After a moment, it’ll display a very rough-and-ready precis of what’s on Wikidata next to that line –

– which makes it easy to spot if you’ve accidentally pasted in the wrong code! Here, we’ve identified one person (with rather limited information, just gender and deathdate, currently in Wikidata, and marked another as definitely not found)

If you’re using the automatically matched list, you’ll see something like this:

– it’s already got the data from the possible matches but wants you to confirm. Clicking on the Q-number will take you to the provisional Wikidata match, and from there you can get to relevant Wikipedia articles if you need further confirmation.

How to match – game mode

We’ve also set up a “game mode”. This is suitable when we expect a high number of the unmatched entries to be connectable to Wikipedia articles; it gives you a random entry from the unmatched list, along with a handful of possible results from a Wikipedia search, and asks you to choose the correct one if it’s there. you can get it by clicking [G] next to the unmatched entries.

Here’s an example, using the OpenPlaques database.

In this one, it was pretty clear that their Roy Castle is the same as the first person listed here (remember him?), so we click the blue Q-number; it’s marked as matched, and the game generates a new entry. Alternatively, we could look him up elsewhere and paste the Q-number or Wikipedia URL in, then click the “set Q” button. If our subject’s not here – click “skip” and move on to the next one.

Finishing up

When you’ve finished matching, go back to the main screen and click the [Y] at the end of the list. This allows you to synchronise the work you’ve done with Wikidata – it will make the edits to Wikidata under your account. (There is also an option to import existing matches from Wikidata, but at the moment the mix-and-match database is a bit out of synch and this is best avoided…) There’s no need to do this if you’re feeling overly cautious, though – we’ll synchronise them soon enough. The same page will also report any cases where two distinct Wikidata entries have been matched to the same identifier, which (usually) shouldn’t happen.

If you want a simple export of the matched data, you can click the [D] link for a TSV file (Q-number, identifier, identifier URL & name if relevant), and some stats on how many matches to individual wikis are available with [S].

Brute force

Finally, if you have a lot of matched data, and you are confident it’s accurate without needing human confirmation, then you can adopt the brute-force method – QuickStatements. This is the tool used for pushing data from mix-and-match to Wikidata, and can be used for any data import. Instructions are on that page – but if you’re going to use it, test it with a few individual items first to make sure it’s doing what you think, and please don’t be shy to ask for help…

So, we’ve covered a) what we’re doing; and b) how we get the information into Wikidata. Next instalment, how to actually use these identifiers for your own purposes…

Wikidata identifiers and the ODNB – where next?

Wednesday, November 26th, 2014

Wikidata, for those of you unfamiliar with it, is the backend we are developing for Wikipedia. At its simplest, it’s a spine linking together the same concept in different languages – so we can tell that a coronation in English matches Tacqoyma in Azeri or Коронація in Ukranian, or thirty-five other languages between. This all gets bundled up into a single data entry – the enigmatically named Q209715 – which then gets other properties attached. In this case, a coronation is a kind of (or subclass of, for you semanticians) “ceremony” (Q2627975), and is linked to a few external thesauruses. The system is fully multilingual, so we can express “coronation – subclass of – ceremony” in English as easily as “kroning – undergruppe af – ceremoni” in Danish.

So far, so good.

There has been a great deal of work around Wikipedia in recent years in connecting our rich-text articles to static authority control records – confirming that our George Washington is the same as the one the Library of Congress knows about. During 2012-13, these were ingested from Wikipedia into Wikidata, and as of a year ago we had identified around 420,000 Wikidata entities with authority control identifiers. Most of these were from VIAF, but around half had an identifier from the German GND database, another half from ISNI, and a little over a third LCCN identifiers. Many had all four (and more). We now support matching to a large number of library catalogue identifiers, but – speaking as a librarian – I’m aware this isn’t very exciting to anyone who doesn’t spend much of their time cataloguing…

So, the next phase was to move beyond simply “authority” identifiers and move to ones that actually provide content. The main project that I’ve been working on (along with Charles Matthews and Magnus Manske, with the help of Jo Payne at OUP) is matching Wikidata to the Oxford Dictionary of National Biography – Wikipedia authors tend to hold the ODNB in high regard, and many of our articles already use it as a reference work. We’re currently about three-quarters of the way through, having identified around 40,000 ODNB entries who have been clearly matched to a Wikidata entity, and the rest should be finished some time in 2015. (You can see the tool here, and how to use that will be a post for another day.) After that, I’ve been working on a project to make links between Wikidata and the History of Parliament (with the assistance of Matthew Kilburn and Paul Seaward) – looking forward to being able to announce some results from this soon.

What does this mean? Well, for a first step, it means we can start making better links to a valuable resource on a more organised basis – for example, Robin Owain and I recently deployed an experimental tool on the Welsh Wikipedia that will generate ODNB links at the end of any article on a relevant subject (see, eg, Dylan Thomas). It means we can start making the Wikisource edition of the (original) Dictionary of National Biography more visible. It means we can quickly generate worklists – you want suitable articles to work on? Well, we have all these interesting and undeniably notable biographies not yet covered in English (or Welsh, or German, or…)

For the ODNB, it opens up the potential for linking to other interesting datasets (and that without having to pass through wikidata – all this can be exported). At the moment, we can identify matches to twelve thousand ISNIs, twenty thousand VIAF identifiers, and – unexpectedly – a thousand entries in IMDb. (Ten of them are entries for “characters”, which opens up a marvellous conceptual can of worms, but let’s leave that aside…).

And for third parties? Well, this is where it gets interesting. If you have ODNB links in your dataset, we can generate Wikipedia entries (probably less valuable, but in oh so many languages). We can generate images for you – Wikidata knows about openly licensed portraits for 214,000 people. Or we can crosswalk to whatever other project we support – YourPaintings links, perhaps? We can match a thousand of those. It can go backwards – we can take your existing VIAF links and give you ODNB entries. (Cataloguers, take note.)

And, best of all, we can ingest that data – and once it’s in Wikidata, the next third party to come along can make the links directly to you, and every new dataset makes the existing ones more valuable. Right now, we have a lot of authority control data, but we’re lighter on serious content links. If you have a useful online project with permanent identifiers, and you’d like to start matching those up to Wikidata, please do get in touch – this is really exciting work and we’d love to work with anyone wanting to help take it forward.

Update: Here’s part 2: on how to use the mix-and-match tool.

Laws on Wikidata

Tuesday, September 9th, 2014

So, I had the day off, and decided to fiddle a little with Wikidata. After some experimenting, it now knows about:

  • 1516 Acts of the Parliament of the United Kingdom (1801-present)
  • 194 Acts of the Parliament of Great Britain (1707-1800)
  • 329 Acts of the Parliament of England (to 1707)
  • 20 Acts of the Parliament of Scotland (to 1707)
  • 19 Acts of the Parliament of Ireland (to 1800)

(Acts of the modern devolved parliaments for NI, Scotland, and Wales will follow.)

Each has a specific “instance of” property – Q18009569, for example, is “act of the Parliament of Scotland” – and is set up as a subclass of the general “act of parliament”. At the moment, there’s detailed subclasses for the UK and Canada (which has a seperate class for each province’s legislation) but nowhere else. Yet…

These numbers are slightly fuzzy – it’s mainly based on Wikipedia articles and so there are a small handful of cases where the entry represents a particular clause (eg Q7444697, s.4 and s.10 of the Human Rights Act 1998), or cases hwere multiple statutes are treated in the same article (eg Q1133144, the Corn Laws), but these are relatively rare and, mostly, it’s a good direct correspondence. (I’ve been fairly careful to keep out oddities, but of course, some will creep in…)

So where next? At the moment, these almost all reflect Wikipedia articles. Only 34 have a link to (English) Wikisource, though I’d guess there’s about 200-250 statutes currently on there. Matching those up will definitely be valuable; for legislation currently in force and on the Statute Law Database, it would be good to be able to crosslink to there as well.