Mechanical Curator on Commons

The internet has been very enthralled by the British Library’s recent release of the Mechanical Curator collection: a million public-domain images extracted from digitised books, put online for people to identify and discover. The real delight is that we don’t know what’s in there – the images have been extracted and sorted by a computer, and human eyes may never have looked at them since they were scanned.

Image taken from page 171 of '[Seonee, or, camp life on the Satpura Range ... Illustrated by the author, etc.]'

I wasn’t directly involved with this – it was released after I left – but it was organised by former colleagues of mine, and I’ve worked on some other projects with the underlying Microsoft Books collection. It’s a great project, and all the more so for being a relatively incidental one. I’m really, really delighted to see it out there, and to see the outpouring of interest and support for it.

One of the questions that’s been asked is: why put them on Flickr and not Commons? The BL has done quite a bit of work with Wikimedia, and has used it as the primary way of distributing material in the past – see the Picturing Canada project – and so it might seem a natural home for a large release of public domain material.

The immediate answer is that Commons is a repository for, essentially, discoverable images. It’s structured with a discovery mechanism built around knowing that you need a picture of X, and finding it by search or by category browsing, which makes metadata essential. It’s not designed for serendipitous browsing, and not able to cope easily with large amounts of unsorted and unidentified material. (I think I can imagine the response were the community to discover 5% of the content of Commons was made up of undiscoverable, unlabelled content…) We have started looking at bringing it across, but on a small scale.

Putting a dump on archive.org has much the same problem – a lack of functional discoverability. There’s no way to casually browse material here, and it relies very much on metadata to make it accessible. If the metadata doesn’t exist, it’s useless.

And so: flickr. Flickr, unlike the repositories, is designed for casual discoverability, for browsing screenfuls of images, and for users to easily tag and annotate them – things that the others don’t easily offer. It’s by far the best environment of the three for engagement and discoverability, even if probably less useful for long-term storage.

This brings the question: should Commons be able to handle this use case? There’s a lot of work being done just now on the future of multimedia: will Commons in 2018 be able to handle the sort of large-scale donation that it would choke on in 2013? Should we be working to support discovery and description of unknown material, or should we be focusing on content which already has good metadata?

8 thoughts on “Mechanical Curator on Commons”

  1. Yes, it should. Many sources of glorious images have limited metadata; especially those that are already under a free license. And bulk contribution, annotation, browsing, and discoverability are important for maintaining and reusing a large repository in either case.

    Thanks for this writeup and background, and kudos to the BL for the amazing project.

  2. Thank you very much for this Andrew as it highlights a major problem with contemporary digital archiving. The tenacious assumption that standardised annotation and classification, metadata or records, is essential to discovery and use. What Flickr, and even Google, have known since their inception is that this is not the case. The curatorial creation of meta-records aids storage and management, i.e discovery and use within very small and specialist communities of practice, but not discovery and use by a broad general public.
    Well done on the piece and well done BL.

  3. To reiterate some points I’ve made on a Wikimedia mailing lists:

    The images contain metadata, which could be used for categorisation, at the book level.

    The whole point of the wiki model is that we make incremental steps towards completion.

    An analogy could be drawn with Wikipedia’s “stub” articles.

    It’s not good for us to lobby institutions to release media, and then decline to accept it.

    I would have liked the release to have been direct to Commons, rather than Flickr; at least, I would have liked the opportunity to debate whether to accept it. I hope that the next tome such an release is being considered, we will be in a better position to facilitate the former.

    Such uploads could be done to Commons by a script (or ‘bot’) which also adds categories and templates, including templated metadata. One such category, for the project, would be permanent. Another, say “Unidentified BL project images” (itself a sub-category of ” unidentified images”) would be refined through crowd sourcing. For example, a drawing of a flower might be first moved to “unidentified plants” and alter to “unidentified dahlias”.

    This crowd sourcing might be by experienced editors, using Commons’ ‘HotCat’ tool and/or manual editing. Or it might be through a tool like the recent OAuth experiments by Magnus Manske, where a user is shown a random image, and asked to select one of a small number of options, following a logic tree.

    The first options might be “animal”, “vegetable” or “mineral”. If the user selects, say, the former, the image category could be changed to “unidentified animals”, or the user could elect to contribute to a second level, say “humans” “other mammals” “birds” “fish” “other”, and so on.

    This could be available also as a mobile app, and could include voting other users’ answers up or down, and some form of competitive point-scoring based on that.

    The BBC have also done some interesting work, using en.Wikipedia article titles as tags, with autocomplete, effectively treating the titles as a folksonomy, for tagging programmes in the World Service archive. I also think there’s mileage in adopting the model (and code, of they’d open source it) used in the BBC/ Public Catalogue foundation project, for crowd-sourcing the tagging of paintings.

    If anyone is interested in developing this idea, I’d be prepared to put in some time, with a view to applying for a WMF grant to complete it.

  4. Klaus: the books aren’t being hidden, it’s just that the system which makes them available isn’t as good as it might be :-)

    For an example of how they’re distributed, have a look at this catalogue record:

    http://explore.bl.uk/primo_library/libweb/action/search.do?vl(freeText0)=%22BLL01014809406%22&vid=BLVU1&fn=search

    which links to http://access.bl.uk/item/pdf/lsidyv3af941e4 (warning – 127mb PDF) which is freely downloadable.

    I know this system was broken a few days ago (overloaded?) and I don’t know if it’s working for all items, but there’s definitely some level of access.

  5. Andy: I don’t really see this as us lobbying for content and then refusing it; rather, I was approached and asked if Commons would be a good home for this material in its current form, and said – no, not as the first public release. Flickr has certain advantages over Commons for the BL’s desired use case, not least that I was worried as to how the community would have reacted!

    (Having crowdsourcing tools would certainly help us make much better use of it, but we’d need to pause to build those and, in the interim, we’d still have a large and challenging body of unlabelled images to deal with.)

    However, now it’s up, there’s nothing stopping us ingesting it from Flickr rather than as a direct upload – and my feeling is that if we selectively harvest from Flickr as the identification process is going on, we’ll get a much better result than had we dumped it straight to Commons.

  6. (I’m aware that we – collectively – are having this conversation across multiple sites.)

    Consider the images at I’m confident I would find it more easy and convenient to go though, say the 150 images at http://www.flickr.com/photos/britishlibrary/tags/sysnum004158128 if they had been Commons in categories called, say “Noble (1831)” and “Mechanical Curator – uncategorised”, and to re-categorise them using tools like HotCat, than to try to upload them to Commons myself, and then do so. And if I spend time tagging them on Flickr, which is also cumbersome, given their horrible new interface, at least in my slow netbook, there seems as yet to be no tool to import those tags, as categories, to Commons.

    I propose that we arrange a bot import, of, say, thousand or two images, from a section of books, categorise them as in my above example, and see what issues arise when we do so, and try to more tightly categorise them.

Leave a Reply

Your email address will not be published. Required fields are marked *