Wikidata and identifiers – part 2, the matching process

November 27th, 2014 by

Yesterday, I wrote about the work we’re doing matching identifiers into Wikidata. Today, the tools we use for it!

Mix-and-match

The main tool we’re using is a beautiful thing Magnus developed called mix-and-match. It imports all the identifiers with some core metadata – for the ODNB, for example, this was names and dates and the brief descriptive text – and sorts them into five groups:

  • Manually matched – these matches have been confirmed by a person (or imported from data already in Wikidata);

  • Automatic – the system has guessed these are probably the same people but wants human confirmation;
  • Unmatched – we have no idea who these identifiers match to;
  • No Wikidata – we know there is currently no Wikidata match;
  • N/A – this identifier shouldn’t match to a Wikidata entity (for example, it’s a placeholder, a subject Wikidata will never cover, or an cross-reference with its own entry).

The goal is to work through everything and move as much as possible to “manually matched”. Anything in this group can then be migrated over to Wikidata with a couple of clicks. Here’s the ODNB as it stands today:

(Want to see what’s happening with the data? The recent changes link will show you the last fifty edits to all the lists.)

So, how do we do this? Firstly, you’ll need a Wikipedia account, and to log in to our “WiDaR” authentication tool. Follow the link on the top of the mix-and-match page (or, indeed, this one), sign in with your Wikipedia account if requested, and you’ll be authorised.

On to the matching itself. There’s two methods – manually, or in a semi-automated “game mode”.

How to match – manually

The first approach works line-by-line. Clicking on one of the entries – here, unmatched ODNB – brings up the first fifty entries in that set. Each one has options on the left hand side – to search Wikidata or English Wikipedia, either by the internal search or Google. On the right-hand side, there are three options – “set Q”, to provide it with a Wikidata ID (these are all of the form Q—–, and so we often call them “Q numbers”); “No WD”, to list it as not on Wikidata; “N/A”, to record that it’s not appropriate for Wikidata matching.

If you’ve found a match on Wikidata, the ID number should be clearly displayed at the top of that page. Click “set Q” and paste it in. If you’ve found a match via Wikipedia, you can click the “Wikidata” link in the left-hand sidebar to take you to the corresponding Wikidata page, and get the ID from there.

After a moment, it’ll display a very rough-and-ready precis of what’s on Wikidata next to that line –

– which makes it easy to spot if you’ve accidentally pasted in the wrong code! Here, we’ve identified one person (with rather limited information, just gender and deathdate, currently in Wikidata, and marked another as definitely not found)

If you’re using the automatically matched list, you’ll see something like this:

– it’s already got the data from the possible matches but wants you to confirm. Clicking on the Q-number will take you to the provisional Wikidata match, and from there you can get to relevant Wikipedia articles if you need further confirmation.

How to match – game mode

We’ve also set up a “game mode”. This is suitable when we expect a high number of the unmatched entries to be connectable to Wikipedia articles; it gives you a random entry from the unmatched list, along with a handful of possible results from a Wikipedia search, and asks you to choose the correct one if it’s there. you can get it by clicking [G] next to the unmatched entries.

Here’s an example, using the OpenPlaques database.

In this one, it was pretty clear that their Roy Castle is the same as the first person listed here (remember him?), so we click the blue Q-number; it’s marked as matched, and the game generates a new entry. Alternatively, we could look him up elsewhere and paste the Q-number or Wikipedia URL in, then click the “set Q” button. If our subject’s not here – click “skip” and move on to the next one.

Finishing up

When you’ve finished matching, go back to the main screen and click the [Y] at the end of the list. This allows you to synchronise the work you’ve done with Wikidata – it will make the edits to Wikidata under your account. (There is also an option to import existing matches from Wikidata, but at the moment the mix-and-match database is a bit out of synch and this is best avoided…) There’s no need to do this if you’re feeling overly cautious, though – we’ll synchronise them soon enough. The same page will also report any cases where two distinct Wikidata entries have been matched to the same identifier, which (usually) shouldn’t happen.

If you want a simple export of the matched data, you can click the [D] link for a TSV file (Q-number, identifier, identifier URL & name if relevant), and some stats on how many matches to individual wikis are available with [S].

Brute force

Finally, if you have a lot of matched data, and you are confident it’s accurate without needing human confirmation, then you can adopt the brute-force method – QuickStatements. This is the tool used for pushing data from mix-and-match to Wikidata, and can be used for any data import. Instructions are on that page – but if you’re going to use it, test it with a few individual items first to make sure it’s doing what you think, and please don’t be shy to ask for help…

So, we’ve covered a) what we’re doing; and b) how we get the information into Wikidata. Next instalment, how to actually use these identifiers for your own purposes…

Wikidata identifiers and the ODNB – where next?

November 26th, 2014 by

Wikidata, for those of you unfamiliar with it, is the backend we are developing for Wikipedia. At its simplest, it’s a spine linking together the same concept in different languages – so we can tell that a coronation in English matches Tacqoyma in Azeri or Коронація in Ukranian, or thirty-five other languages between. This all gets bundled up into a single data entry – the enigmatically named Q209715 – which then gets other properties attached. In this case, a coronation is a kind of (or subclass of, for you semanticians) “ceremony” (Q2627975), and is linked to a few external thesauruses. The system is fully multilingual, so we can express “coronation – subclass of – ceremony” in English as easily as “kroning – undergruppe af – ceremoni” in Danish.

So far, so good.

There has been a great deal of work around Wikipedia in recent years in connecting our rich-text articles to static authority control records – confirming that our George Washington is the same as the one the Library of Congress knows about. During 2012-13, these were ingested from Wikipedia into Wikidata, and as of a year ago we had identified around 420,000 Wikidata entities with authority control identifiers. Most of these were from VIAF, but around half had an identifier from the German GND database, another half from ISNI, and a little over a third LCCN identifiers. Many had all four (and more). We now support matching to a large number of library catalogue identifiers, but – speaking as a librarian – I’m aware this isn’t very exciting to anyone who doesn’t spend much of their time cataloguing…

So, the next phase was to move beyond simply “authority” identifiers and move to ones that actually provide content. The main project that I’ve been working on (along with Charles Matthews and Magnus Manske, with the help of Jo Payne at OUP) is matching Wikidata to the Oxford Dictionary of National Biography – Wikipedia authors tend to hold the ODNB in high regard, and many of our articles already use it as a reference work. We’re currently about three-quarters of the way through, having identified around 40,000 ODNB entries who have been clearly matched to a Wikidata entity, and the rest should be finished some time in 2015. (You can see the tool here, and how to use that will be a post for another day.) After that, I’ve been working on a project to make links between Wikidata and the History of Parliament (with the assistance of Matthew Kilburn and Paul Seaward) – looking forward to being able to announce some results from this soon.

What does this mean? Well, for a first step, it means we can start making better links to a valuable resource on a more organised basis – for example, Robin Owain and I recently deployed an experimental tool on the Welsh Wikipedia that will generate ODNB links at the end of any article on a relevant subject (see, eg, Dylan Thomas). It means we can start making the Wikisource edition of the (original) Dictionary of National Biography more visible. It means we can quickly generate worklists – you want suitable articles to work on? Well, we have all these interesting and undeniably notable biographies not yet covered in English (or Welsh, or German, or…)

For the ODNB, it opens up the potential for linking to other interesting datasets (and that without having to pass through wikidata – all this can be exported). At the moment, we can identify matches to twelve thousand ISNIs, twenty thousand VIAF identifiers, and – unexpectedly – a thousand entries in IMDb. (Ten of them are entries for “characters”, which opens up a marvellous conceptual can of worms, but let’s leave that aside…).

And for third parties? Well, this is where it gets interesting. If you have ODNB links in your dataset, we can generate Wikipedia entries (probably less valuable, but in oh so many languages). We can generate images for you – Wikidata knows about openly licensed portraits for 214,000 people. Or we can crosswalk to whatever other project we support – YourPaintings links, perhaps? We can match a thousand of those. It can go backwards – we can take your existing VIAF links and give you ODNB entries. (Cataloguers, take note.)

And, best of all, we can ingest that data – and once it’s in Wikidata, the next third party to come along can make the links directly to you, and every new dataset makes the existing ones more valuable. Right now, we have a lot of authority control data, but we’re lighter on serious content links. If you have a useful online project with permanent identifiers, and you’d like to start matching those up to Wikidata, please do get in touch – this is really exciting work and we’d love to work with anyone wanting to help take it forward.

Update: Here’s part 2: on how to use the mix-and-match tool.

Something Must Be Done

October 12th, 2014 by

G. K. Chesterton, in The Flying Inn, 1914:

…this chaotic leader was followed by quite a considerable mass of public correspondence. The people who write to newspapers are, it may be supposed, a small, eccentric body, like most of those that sway a modern state. But at least, unlike the lawyers, or the financiers, or the members of Parliament, or the men of science, they are people of all kinds scattered all over the country, of all classes, counties, ages, sects, sexes, and stages of insanity. The letters that followed Hibbs’s article are still worth looking up in the dusty old files of his paper.

(…)

And then, last but the reverse of least, there plunged in all the people who think they can solve a problem they cannot understand by abolishing everything that has contributed to it. We all know these people. If a barber has cut his customer’s throat because the girl has changed her partner for a dance or donkey ride on Hampstead Heath, there are always people to protest against the mere institutions that led up to it. This would not have happened if barbers were abolished, or if cutlery were abolished, or if the objection felt by girls to imperfectly grown beards were abolished, or if the girls were abolished, or if heaths and open spaces were abolished, or if dancing were abolished, or if donkeys were abolished. But donkeys, I fear, will never be abolished.

There were plenty of such donkeys in the common land of this particular controversy. Some made it an argument against democracy, because poor Garge was a carpenter. Some made it an argument against Alien Immigration, because Misysra Ammon was a Turk. Some proposed that ladies should no longer be admitted to any lectures anywhere, because they had constituted a slight and temporary difficulty at this one, without the faintest fault of their own. Some urged that all holiday resorts should be abolished; some urged that all holidays should be abolished. Some vaguely denounced the sea-side; some, still more vaguely, proposed to remove the sea. All said that if this or that, stones or sea-weed or strange visitors or bad weather or bathing machines were swept away with a strong hand, this which had happened would not have happened. They only had one slight weakness, all of them; that they did not seem to have the faintest notion of what had happened.

The referendum: the same dilemma on both sides

September 10th, 2014 by

So, the referendum is well into the final stretch. With the polls pretty much neck-and-neck (and switching around!), the No camp has formally announced what’s been kicked around for a long time: vote No to independence, get much more devolution. (It’s not really a secret that this is what most people would have voted for all along if given a three way fight, but it’s odd to think that there’s no status quo in the middle any more.)

The Yes campaign’s response is, not unreasonably, best approximated by “yeah, right, sure you will”.

This has the strange effect of inverting the dynamic of the campaign, or at least inverting the dynamic of the campaign I get to see, which is people arguing online. (Thanks to a chain of what were at-the-time seemingly minor decisions, I ended up living in Cambridge and so don’t have a vote)

Last week, the No camp were saying “look, Salmond is winging it; he’s promising all these things he’ll negotiate for, and he’s confident he’ll get them, but… really? He still has to negotiate.”

This week, the Yes camp are saying “look, Westminster are winging it; they’re promising all these things they’ll do, and they’re swearing blind they will, but… really? They still have to actually do it.”

It’s all a bit cyclic.

We’re left with a strange impasse. The vote has to be made without knowing what the outcome of post-referendum negotiations are going to be – we know that Salmond will follow through with what he says he plans to negotiate for, we just don’t know whether he’ll achieve much of it. But it also has to be made without knowing whether the other side will honour their promises for devolution – there’s no question that a joint position of the three parties can deliver almost any program, if they actually follow through.

So one side dearly want to push through their program, but may not have the power to do so; the other side undoubtedly have the power to carry out theirs, but may not really want to. We’ll find out the answer to one (and only one) after the vote. But if either win and don’t get what they’re saying now, it’ll be a mess. Two pigs, two pokes.

In some ways, it comes down to which bit of the British establishment you are more cynical about: if you think the civil service are going to be exceptionally hard-headed negotiators and Salmond won’t get far, then voting No makes sense; if you think the three party leaders are going to turn around and pull the rug out as soon as they get the chance, then voting Yes might seem safer.

Laws on Wikidata

September 9th, 2014 by

So, I had the day off, and decided to fiddle a little with Wikidata. After some experimenting, it now knows about:

  • 1516 Acts of the Parliament of the United Kingdom (1801-present)
  • 194 Acts of the Parliament of Great Britain (1707-1800)
  • 329 Acts of the Parliament of England (to 1707)
  • 20 Acts of the Parliament of Scotland (to 1707)
  • 19 Acts of the Parliament of Ireland (to 1800)

(Acts of the modern devolved parliaments for NI, Scotland, and Wales will follow.)

Each has a specific “instance of” property – Q18009569, for example, is “act of the Parliament of Scotland” – and is set up as a subclass of the general “act of parliament”. At the moment, there’s detailed subclasses for the UK and Canada (which has a seperate class for each province’s legislation) but nowhere else. Yet…

These numbers are slightly fuzzy – it’s mainly based on Wikipedia articles and so there are a small handful of cases where the entry represents a particular clause (eg Q7444697, s.4 and s.10 of the Human Rights Act 1998), or cases hwere multiple statutes are treated in the same article (eg Q1133144, the Corn Laws), but these are relatively rare and, mostly, it’s a good direct correspondence. (I’ve been fairly careful to keep out oddities, but of course, some will creep in…)

So where next? At the moment, these almost all reflect Wikipedia articles. Only 34 have a link to (English) Wikisource, though I’d guess there’s about 200-250 statutes currently on there. Matching those up will definitely be valuable; for legislation currently in force and on the Statute Law Database, it would be good to be able to crosslink to there as well.

Conservation science: open access might not be endangered after all

September 5th, 2014 by

I was very struck to see this paper this morning: Fuller, R. A., J. R. Lee, and J. E. M. Watson. 2014. “Achieving open access to conservation science“. Conservation Biology 28. doi:10.1111/cobi.12346.

Conservation science is a crisis discipline in which the results of scientific enquiry must be made available quickly to those implementing management. We assessed the extent to which scientific research published since the year 2000 in 20 conservation science journals is publicly available. Of the 19,207 papers published, 1,667 (8.68%) are freely downloadable from an official repository. Moreover, only 938 papers (4.88%) meet the standard definition of open access in which material can be freely reused providing attribution to the authors is given. This compares poorly with a comparable set of 20 evolutionary biology journals, where 31.93% of papers are freely downloadable and 7.49% are open access.

These headline numbers seemed very disappointing – but, after some examination, it seems that the real figure may be substantially higher. Open access isn’t dead yet.

The authors’ definition of “open access” is given as “full” BOAI open-access – that is to say, the final published version made available with minimal restrictions on reuse, usually marked with the CC-BY license or something functionally equivalent. This is not my preference, but fairly reasonable given that “free access” is also considered.

However, their definition of “free access” is substantially more restrictive than the usual “green open access” (free to read but with limited reuse rights). It only covers articles made freely available as the version of record “from the journal’s official archive”:

If we were able to download a paper freely from the journal’s official archive from a private computer not linked to a university network but it did not conform to our definition of open access, we classified it as freely available. Such papers either had additional restrictions attached to them (e.g., excluding commercial reuse or the production of derivatives) or retained all rights and had simply been made freely available online temporarily or permanently by the license holder. We classified all remaining articles as subscription access.

This is a fairly specific requirement. Everything else was deemed unavailable, with an acknowledgement that some might be found in preprint servers:

We did not include access to journal articles via pre-print servers because these do not represent the final published version of the manuscript and can be hard for nonspecialists to navigate, although it is worth noting that preprint servers such as arXiv.org are major repositories of information in several disciplines including physics and mathematics and could play a role in access to conservation science if conservation articles reached a critical mass in such repositories.

Treating this as a divide between “journal archives” and “pre-print servers” entirely omits institutional repositories, which provide a significant amount of green open access material – in most disciplines, substantially more than is available through preprint servers. It will inevitably lead to a significant undercount of the amount of material available to the reader. Unfortunately, the paper’s abstract uses the phrase “freely downloadable from an official repository” – implying that repositories are covered by the scope of the study. (I had to read the paper twice to check they weren’t).

The concern about desiring the “final published version” is fair, but a) most people are satisfied with some form of the text, and b) many copies available from repositories are in fact the final published version of the text. This varies by publisher and title, but I have dealt with papers in both Oryx and Environmental Conservation, both on their shortlist, and know that Cambridge permits posting of the version of record in both subject and institutional repositories.

Finally, a substantial amount of “informal open access” exists with copies available through the author’s own websites, research group sites, semi-public networks such as ResearchGate, etc. While these may not always be entirely legitimate, they represent a very substantial amount of papers. A study found that around 48% of 2008 published papers had become available on a “free to read somewhere online” basis by 2012, if such informal sources were included.

Put together, it is clear that the 8.68% of “freely downloadable” papers omits a substantial amount of material which could be available to the non-subscribed reader through various means. How much? I don’t know, but I strongly suspect it’s at least as many again…

The problem with 38 Degrees

August 4th, 2014 by

A few years ago, I did something using the 38 Degrees website – I forget what exactly, but I think it was a handy form for an irritated letter to a politician. It was a Labour minister, which dates this!

Over the next few years, I got a series of emails from them, culminating in the point when I noticed with some surprise that they were referring to me as part of “their movement” and I unsubscribed with some irritation. It seems they’re still at it; looking at their website, we find the remarkable claim that they have over 2.5 million members (which would make them the third largest organisation in the country).

According to their FAQ:

The only requirement of membership is to take an action, as simple as signing a petition, or attending an event.

And looking at a recent petition (on a topic dear to my heart…) we find a very carefully prepared option: you can opt in to receiving emails from the organisation behind the campaign, but you are automatically signed up to be on the 38 Degrees mailing list. The note here says:

Your personal information will be kept private and held securely. By submitting information you are agreeing to 38 Degrees keeping you informed about campaigns and agree to the use of cookies. privacy policy

Note what it does and doesn’t say. So, the system runs like this:

  • you sign a petition;

  • you are automatically enrolled on a mailing list, with the stated aim of “keeping you informed about campaigns” (section 3(g) of the privacy statement).
  • you are automatically considered a member of an organisation, despite this appearing nowhere on the signup or privacy page.
  • this organisation then claims the legitimacy of “2.5 million members”

The only way to get out of being “a member” is to notice this and unsubscribe. I can’t help but feel there’s something fundamentally disingenuous about this approach, and it leaves me with a pretty bad taste in my mouth.

Update: it seems that (at least as of 2013) 38 Degrees do send welcome-to-the-movement emails. (They certainly didn’t in 2010). It’s something, I suppose, but it still feels insufficient.

Mechanical Curator on Commons

December 15th, 2013 by

The internet has been very enthralled by the British Library’s recent release of the Mechanical Curator collection: a million public-domain images extracted from digitised books, put online for people to identify and discover. The real delight is that we don’t know what’s in there – the images have been extracted and sorted by a computer, and human eyes may never have looked at them since they were scanned.

Image taken from page 171 of '[Seonee, or, camp life on the Satpura Range ... Illustrated by the author, etc.]'

I wasn’t directly involved with this – it was released after I left – but it was organised by former colleagues of mine, and I’ve worked on some other projects with the underlying Microsoft Books collection. It’s a great project, and all the more so for being a relatively incidental one. I’m really, really delighted to see it out there, and to see the outpouring of interest and support for it.

One of the questions that’s been asked is: why put them on Flickr and not Commons? The BL has done quite a bit of work with Wikimedia, and has used it as the primary way of distributing material in the past – see the Picturing Canada project – and so it might seem a natural home for a large release of public domain material.

The immediate answer is that Commons is a repository for, essentially, discoverable images. It’s structured with a discovery mechanism built around knowing that you need a picture of X, and finding it by search or by category browsing, which makes metadata essential. It’s not designed for serendipitous browsing, and not able to cope easily with large amounts of unsorted and unidentified material. (I think I can imagine the response were the community to discover 5% of the content of Commons was made up of undiscoverable, unlabelled content…) We have started looking at bringing it across, but on a small scale.

Putting a dump on archive.org has much the same problem – a lack of functional discoverability. There’s no way to casually browse material here, and it relies very much on metadata to make it accessible. If the metadata doesn’t exist, it’s useless.

And so: flickr. Flickr, unlike the repositories, is designed for casual discoverability, for browsing screenfuls of images, and for users to easily tag and annotate them – things that the others don’t easily offer. It’s by far the best environment of the three for engagement and discoverability, even if probably less useful for long-term storage.

This brings the question: should Commons be able to handle this use case? There’s a lot of work being done just now on the future of multimedia: will Commons in 2018 be able to handle the sort of large-scale donation that it would choke on in 2013? Should we be working to support discovery and description of unknown material, or should we be focusing on content which already has good metadata?

Not all encyclopedias are created equal

August 3rd, 2013 by

Wikipedia has some way to go before it can comprehensively replace the great Britannica in all its many roles. From Shackleton’s South, a passage in which he and his crew are stranded on a drifting ice-floe in the Weddell Sea, November 1915:

In addition to the daily hunt for food, our time was passed in reading the few books that we had managed to save from the ship. The greatest treasure in the library was a portion of the “Encyclopaedia Britannica.” This was being continually used to settle the inevitable arguments that would arise. The sailors were discovered one day engaged in a very heated discussion on the subject of Money and Exchange. They finally came to the conclusion that the Encyclopaedia, since it did not coincide with their views, must be wrong.

“For descriptions of every American town that ever has been, is, or ever will be, and for full and complete biographies of every American statesman since the time of George Washington and long before, the Encyclopaedia would be hard to beat. Owing to our shortage of matches we have been driven to use it for purposes other than the purely literary ones though; and one genius having discovered that the paper, used for its pages had been impregnated with saltpetre, we can now thoroughly recommend it as a very efficient pipe-lighter.”

We also possessed a few books on Antarctic exploration, a copy of Browning and one of “The Ancient Mariner.” On reading the latter, we sympathized with him and wondered what he had done with the albatross; it would have made a very welcome addition to our larder.

Young Cree man, 1902

June 26th, 2013 by

Most of the Picturing Canada images are of historic rather than aesthetic value, but here’s a really standout portrait I spotted today:

Cree Indian (HS85-10-13885)

A young Cree man, name unrecorded; probably taken in Alberta or Saskatchewan, 1902. A little fragment of history.