Wikidata and identifiers – part 2, the matching process

November 27th, 2014 by

Yesterday, I wrote about the work we’re doing matching identifiers into Wikidata. Today, the tools we use for it!

Mix-and-match

The main tool we’re using is a beautiful thing Magnus developed called mix-and-match. It imports all the identifiers with some core metadata – for the ODNB, for example, this was names and dates and the brief descriptive text – and sorts them into five groups:

  • Manually matched – these matches have been confirmed by a person (or imported from data already in Wikidata);
  • Automatic – the system has guessed these are probably the same people but wants human confirmation;
  • Unmatched – we have no idea who these identifiers match to;
  • No Wikidata – we know there is currently no Wikidata match;
  • N/A – this identifier shouldn’t match to a Wikidata entity (for example, it’s a placeholder, a subject Wikidata will never cover, or an cross-reference with its own entry).

The goal is to work through everything and move as much as possible to “manually matched”. Anything in this group can then be migrated over to Wikidata with a couple of clicks. Here’s the ODNB as it stands today:

(Want to see what’s happening with the data? The recent changes link will show you the last fifty edits to all the lists.)

So, how do we do this? Firstly, you’ll need a Wikipedia account, and to log in to our “WiDaR” authentication tool. Follow the link on the top of the mix-and-match page (or, indeed, this one), sign in with your Wikipedia account if requested, and you’ll be authorised.

On to the matching itself. There’s two methods – manually, or in a semi-automated “game mode”.

How to match – manually

The first approach works line-by-line. Clicking on one of the entries – here, unmatched ODNB – brings up the first fifty entries in that set. Each one has options on the left hand side – to search Wikidata or English Wikipedia, either by the internal search or Google. On the right-hand side, there are three options – “set Q”, to provide it with a Wikidata ID (these are all of the form Q—–, and so we often call them “Q numbers”); “No WD”, to list it as not on Wikidata; “N/A”, to record that it’s not appropriate for Wikidata matching.

If you’ve found a match on Wikidata, the ID number should be clearly displayed at the top of that page. Click “set Q” and paste it in. If you’ve found a match via Wikipedia, you can click the “Wikidata” link in the left-hand sidebar to take you to the corresponding Wikidata page, and get the ID from there.

After a moment, it’ll display a very rough-and-ready precis of what’s on Wikidata next to that line –

– which makes it easy to spot if you’ve accidentally pasted in the wrong code! Here, we’ve identified one person (with rather limited information, just gender and deathdate, currently in Wikidata, and marked another as definitely not found)

If you’re using the automatically matched list, you’ll see something like this:

– it’s already got the data from the possible matches but wants you to confirm. Clicking on the Q-number will take you to the provisional Wikidata match, and from there you can get to relevant Wikipedia articles if you need further confirmation.

How to match – game mode

We’ve also set up a “game mode”. This is suitable when we expect a high number of the unmatched entries to be connectable to Wikipedia articles; it gives you a random entry from the unmatched list, along with a handful of possible results from a Wikipedia search, and asks you to choose the correct one if it’s there. you can get it by clicking [G] next to the unmatched entries.

Here’s an example, using the OpenPlaques database.

In this one, it was pretty clear that their Roy Castle is the same as the first person listed here (remember him?), so we click the blue Q-number; it’s marked as matched, and the game generates a new entry. Alternatively, we could look him up elsewhere and paste the Q-number or Wikipedia URL in, then click the “set Q” button. If our subject’s not here – click “skip” and move on to the next one.

Finishing up

When you’ve finished matching, go back to the main screen and click the [Y] at the end of the list. This allows you to synchronise the work you’ve done with Wikidata – it will make the edits to Wikidata under your account. (There is also an option to import existing matches from Wikidata, but at the moment the mix-and-match database is a bit out of synch and this is best avoided…) There’s no need to do this if you’re feeling overly cautious, though – we’ll synchronise them soon enough. The same page will also report any cases where two distinct Wikidata entries have been matched to the same identifier, which (usually) shouldn’t happen.

If you want a simple export of the matched data, you can click the [D] link for a TSV file (Q-number, identifier, identifier URL & name if relevant), and some stats on how many matches to individual wikis are available with [S].

Brute force

Finally, if you have a lot of matched data, and you are confident it’s accurate without needing human confirmation, then you can adopt the brute-force method – QuickStatements. This is the tool used for pushing data from mix-and-match to Wikidata, and can be used for any data import. Instructions are on that page – but if you’re going to use it, test it with a few individual items first to make sure it’s doing what you think, and please don’t be shy to ask for help…

So, we’ve covered a) what we’re doing; and b) how we get the information into Wikidata. Next instalment, how to actually use these identifiers for your own purposes…

Tags: ,

10 Responses to “Wikidata and identifiers – part 2, the matching process”

  1. Generalising » Blog Archive » Wikidata identifiers and the ODNB – where next? Says:

    […] Wikidata identifiers and the ODNB – where next? […]

  2. James Heald Says:

    Hi Andrew,

    Thanks for a really useful couple of posts.

    A couple of questions:

    * I’ve recently added a couple of hundred links for “My Paintings” directly, without using Mix’n’match. Will the tool become aware of these and update?

    * Secondly, what are the criteria for producing Automatic matches, and what format does the tool need them in? Under what circumstances are we likely to be able to make an automatic match, but still need the additional confidence of manual confirmation?

    — James.

  3. Andrew Says:

    Hi James,

    1) They will do if you go to the synchronisation page ([Y]) and hit “Update mix-and-match”. The database connection seems to be lagging a bit, though, so this may not work for a little while. Anything synchronised through this goes directly into “matched manually”.

    2) Auto-matched is mostly from a simple script that looks at names. It gets a lot of matches, but a lot of false positives (especially with the John Smiths). For Venn (and some others?) we pre-populated the auto-matched with all the existing links from Wikipedia. About 90% of these are to the same person as the subject of the article, but enough aren’t to make it worth checking.

    If you have a set of probable matches, then we may be able to import them to auto-matched from a tsv/csv of Q-numbers + identifiers.

  4. Pierre Says:

    I was wondering if there was a mix-and-match like tool to bring a list of labels and get wikidata items for them:
    We have a nice food taxonomy on Open Food Facts, and we want to link it to Wikidata. We can either do that by adding Open Food Facts identifiers in Wikidata or adding Wikidata identifiers in the Open Food Facts taxonomy.

    http://en.wiki.openfoodfacts.org/Global_categories_taxonomy

    Is there such a tool, and what would you advise ?

  5. Andrew Says:

    Hi Pierre,

    I don’t think there’s such a tool (at least, not at the moment). I would have suggested the excellent WikiDataQuery, but oddly it doesn’t seem to include the option to query labels: https://wdq.wmflabs.org/api_documentation.html

    Do you have any other identifiers embedded, or Wikipedia links?

  6. Pierre Says:

    We’re creating Wikipedia/Wikidata links, and we have identifiers for various things (see https://www.wikidata.org/wiki/Wikidata:Bot_requests#Adding_UNII_identifiers)

  7. Pierre Says:

    We have created a food project on Wikidata and we’ve listed Mix N’ Match for the UNII and E-Numbers matching.
    Open Food Facts heavily relies on crowdsourcing and we are thus Wikidata contributors.
    Feel free to think of us if you encounter food or cosmetics related data, or if you can create a food related game :-)

    https://www.wikidata.org/wiki/Wikidata:WikiProject_Food
    http://world.openfoodfacts.org

  8. Pierre Says:

    I’ve also noticed that SOC Occupation Code (2010) (P919) didn’t have much data (https://tools.wmflabs.org/wikidata-todo/translate_items_with_property.php?prop=919), while a canonical list from the Bureau of Labor Statistics (thus PD and freely reusable, cf http://www.bls.gov/bls/linksite.htm) does exist (we’re talking 7K items).

    http://www.bls.gov/soc/#materials

    The data is tabulated as follows:

    SOC Codes and Job Titles
    2010 SOC Code 2010 SOC Title 2010 SOC Direct Match Title
    11-1011 Chief Executives CEO
    11-1011 Chief Executives Chief Executive Officer
    11-1011 Chief Executives Chief Operating Officer
    11-1011 Chief Executives Commissioner of Internal Revenue
    11-1021 General and Operations Managers Department Store General Manager

    http://www.bls.gov/soc/#materials It is available at : http://www.bls.gov/soc/soc_2010_direct_match_title_file.xls (with more related files at http://www.bls.gov/soc/#materials)

    What I think could be nice are adding the SOC codes based on the 2010 SOC Direct Match Title since so many outside services and pages use the SOC codes (which seems to be a requirement for a lot of job offers in the US).

    I’ve already added manually the French equivalent of the SOC code in Wikidata, so being able to match national codes and job titles through Wikidata would be cool.

  9. Stefano Costa Says:

    I’m quite late here, but first of all thanks for creating these tools!

    Is there any way to add other catalogues to those supported by Mix-and-match?

  10. Sannita Says:

    Hey, your posts about Mix’n’match are interesting, may I translate them in Italian? I was thinking of creating “a how-to for dummies”, but I can’t find any license or indication about copyright… so may I use your texts? :)

    Cheers!

Leave a Reply