At-risk content on Flickr

Flickr has recently announced that it will be cutting back storage for its free accounts; as of early 2019, they will be limited to 1000 images, and any files beyond that limit will be progressively deleted.

Personally speaking, this surprised me a little bit, because I’d forgotten they’d removed the 200-image limit a few years ago. I am generally quite comfortable with the idea of them imposing a capacity limit and charging to go beyond that; it’s a fair way to price your service, and ultimately, it has to be paid for. But retroactive deletion is a bit unfortunate (especially if handled as an abrupt guillotine).

A few people raised the reasonable question – how much material is now at risk? A huge chunk of Wikimedia Commons material is sourced from Flickr (imported under free licenses) and, in addition, there is the reasonably successful Flickr Commons program for image hosting from cultural institutions.

Looking at the 115 Flickr Commons accounts shows that there are ~480,000 images from the 54 Pro accounts, and ~6,450,000 from the 61 non-Pro accounts. This seems a very dramatic difference, but on closer examination the British Library and Internet Archive (both non-Pro accounts) make up the vast majority of this, with ~6,350,000 images, mostly extracts from digitized book images. Flickr have since stated that Flickr Commons accounts will not be affected (it will be interesting to see if they now expand the program to include many of the other institutional accounts).

For “normal” users, it’s a bit harder to be sure. Flickr state that “the overwhelming majority of Pros have more than 1,000 photos on Flickr, and more than 97% of Free members have fewer than 1,000”. But from the Commons perspective, what we really want to know is “what proportion of the kind thing we want to import is at risk?” Looking at this type of material is potentially quite interesting – it goes beyond the simple “Flickr as a personal photostore” and into “Flickr as a source of the cultural commons”.

So, analysis time! I pulled a list of all outbound links from Commons. For simplicity, I didn’t try to work out which of these were links from file pages as opposed to navigational/maintenance/user pages, but a quick sanity-check suggests that the vast majority of pages with outbound Flickr links are file descriptions – something like 99.7% – so it seems reasonable to just take the whole lot. I then extracted any flickr userIDs I could find, either in links to author profiles or in image URLs themselves, (eg 12403504@N02), and deduplicated the results so we ended up with a pile of userID-page pairs. The deduplication was necessary because a raw count of links can get quite confusing – some of the Internet Archive imports can have 20-30 links per file description page, and one of the British Library map maintenance pages has 9500…

One critical omission here is that I only took “raw” userIDs, not pretty human-readable ones (like “britishlibrary”); this was for practical reasons because I couldn’t easily link the two together. Many items are only linked with human-readable labels in the URLs, but ~96% of pages with an outbound Flickr link have at least one identifiable userID on them, so hopefully the remaining omissions won’t skew the results too much. (I also threw out any group IDs at this point to avoid confusion.)

I used this to run two analyses. One was the most frequently used userIDs – this was the top 5021 userIDs in our records, any ID that had links from approximately ~80 pages or more. The other was a random sample of userIDs – 5000 randomly selected from the full set of ~79000. With each sample, I used the number of links on Commons as a proxy for the number of images (which seems fair enough).

Among the most frequently used source accounts, I found that 50% of images came from Pro accounts, 35% from “at risk” free accounts (more than 1000 images), 3% from “safe” free accounts (under 1000 images), 11% from Flickr Commons (both pro & non-Pro), and 1% were from accounts that are now deactivated or have no images.

In the random sample, I found a somewhat different spread – 60% of images were from Pro accounts, 32% from “at risk” free accounts, 6% from “safe” free accounts, 2% Flickr Commons, and 0.25% missing.

Update: an extended sample of all accounts with ten or more links (19374 in total) broadly resembles the top 5000 – 49% Pro accounts, 35% “at risk” free accounts, 4.5% “safe” free accounts, 10% Flickr Commons accounts, and 1.5% missing.

So, some quick conclusions –

  • Openly-licensed material gathered from Flickr is a significant source for Commons – something like 7.5m file description pages link to Flickr, almost certainly as a source, about 15% of all files
  • A substantial amount of material sourced from Flickr comes from a relatively small number of accounts, some institutional and some personal (this was the most common one in my random sample – 58k images)
  • A substantial portion of our heavily used Flickr source accounts are potentially at risk (note that it is not possible to tell how many were once Pro, have lapsed because why bother when it’s free, and may resume paying)
  • It is not as catastrophic as it might at first appear – the samples all suggest that only about a third of potential source images are at risk, once the Flickr Commons accounts are exempted from the limits – which seems to be the plan.
  • Having said that, the figure of 97% of individual free accounts having under a thousand images is no doubt accurate, but probably masks the sheer number of images in many of the larger accounts.

Some things that would potentially still be very interesting to know –

  • What proportion of freely-licensed images are from at-risk accounts?
  • What proportion of images in at-risk accounts are actually freely-licensed?
  • What proportion of freely-licensed images on Flickr have (or could) be transferred over to Commons?
  • Are Flickr Commons accounts exempt from the size restriction? (As there are only ~150 of them, this seems plausible as a special case…)

Mechanical Curator on Commons

The internet has been very enthralled by the British Library’s recent release of the Mechanical Curator collection: a million public-domain images extracted from digitised books, put online for people to identify and discover. The real delight is that we don’t know what’s in there – the images have been extracted and sorted by a computer, and human eyes may never have looked at them since they were scanned.

Image taken from page 171 of '[Seonee, or, camp life on the Satpura Range ... Illustrated by the author, etc.]'

I wasn’t directly involved with this – it was released after I left – but it was organised by former colleagues of mine, and I’ve worked on some other projects with the underlying Microsoft Books collection. It’s a great project, and all the more so for being a relatively incidental one. I’m really, really delighted to see it out there, and to see the outpouring of interest and support for it.

One of the questions that’s been asked is: why put them on Flickr and not Commons? The BL has done quite a bit of work with Wikimedia, and has used it as the primary way of distributing material in the past – see the Picturing Canada project – and so it might seem a natural home for a large release of public domain material.

The immediate answer is that Commons is a repository for, essentially, discoverable images. It’s structured with a discovery mechanism built around knowing that you need a picture of X, and finding it by search or by category browsing, which makes metadata essential. It’s not designed for serendipitous browsing, and not able to cope easily with large amounts of unsorted and unidentified material. (I think I can imagine the response were the community to discover 5% of the content of Commons was made up of undiscoverable, unlabelled content…) We have started looking at bringing it across, but on a small scale.

Putting a dump on archive.org has much the same problem – a lack of functional discoverability. There’s no way to casually browse material here, and it relies very much on metadata to make it accessible. If the metadata doesn’t exist, it’s useless.

And so: flickr. Flickr, unlike the repositories, is designed for casual discoverability, for browsing screenfuls of images, and for users to easily tag and annotate them – things that the others don’t easily offer. It’s by far the best environment of the three for engagement and discoverability, even if probably less useful for long-term storage.

This brings the question: should Commons be able to handle this use case? There’s a lot of work being done just now on the future of multimedia: will Commons in 2018 be able to handle the sort of large-scale donation that it would choke on in 2013? Should we be working to support discovery and description of unknown material, or should we be focusing on content which already has good metadata?