Flickr has recently announced that it will be cutting back storage for its free accounts; as of early 2019, they will be limited to 1000 images, and any files beyond that limit will be progressively deleted.
Personally speaking, this surprised me a little bit, because I’d forgotten they’d removed the 200-image limit a few years ago. I am generally quite comfortable with the idea of them imposing a capacity limit and charging to go beyond that; it’s a fair way to price your service, and ultimately, it has to be paid for. But retroactive deletion is a bit unfortunate (especially if handled as an abrupt guillotine).
A few people raised the reasonable question – how much material is now at risk? A huge chunk of Wikimedia Commons material is sourced from Flickr (imported under free licenses) and, in addition, there is the reasonably successful Flickr Commons program for image hosting from cultural institutions.
Looking at the 115 Flickr Commons accounts shows that there are ~480,000 images from the 54 Pro accounts, and ~6,450,000 from the 61 non-Pro accounts. This seems a very dramatic difference, but on closer examination the British Library and Internet Archive (both non-Pro accounts) make up the vast majority of this, with ~6,350,000 images, mostly extracts from digitized book images. Flickr have since stated that Flickr Commons accounts will not be affected (it will be interesting to see if they now expand the program to include many of the other institutional accounts).
For “normal” users, it’s a bit harder to be sure. Flickr state that “the overwhelming majority of Pros have more than 1,000 photos on Flickr, and more than 97% of Free members have fewer than 1,000”. But from the Commons perspective, what we really want to know is “what proportion of the kind thing we want to import is at risk?” Looking at this type of material is potentially quite interesting – it goes beyond the simple “Flickr as a personal photostore” and into “Flickr as a source of the cultural commons”.
So, analysis time! I pulled a list of all outbound links from Commons. For simplicity, I didn’t try to work out which of these were links from file pages as opposed to navigational/maintenance/user pages, but a quick sanity-check suggests that the vast majority of pages with outbound Flickr links are file descriptions – something like 99.7% – so it seems reasonable to just take the whole lot. I then extracted any flickr userIDs I could find, either in links to author profiles or in image URLs themselves, (eg 12403504@N02), and deduplicated the results so we ended up with a pile of userID-page pairs. The deduplication was necessary because a raw count of links can get quite confusing – some of the Internet Archive imports can have 20-30 links per file description page, and one of the British Library map maintenance pages has 9500…
One critical omission here is that I only took “raw” userIDs, not pretty human-readable ones (like “britishlibrary”); this was for practical reasons because I couldn’t easily link the two together. Many items are only linked with human-readable labels in the URLs, but ~96% of pages with an outbound Flickr link have at least one identifiable userID on them, so hopefully the remaining omissions won’t skew the results too much. (I also threw out any group IDs at this point to avoid confusion.)
I used this to run two analyses. One was the most frequently used userIDs – this was the top 5021 userIDs in our records, any ID that had links from approximately ~80 pages or more. The other was a random sample of userIDs – 5000 randomly selected from the full set of ~79000. With each sample, I used the number of links on Commons as a proxy for the number of images (which seems fair enough).
Among the most frequently used source accounts, I found that 50% of images came from Pro accounts, 35% from “at risk” free accounts (more than 1000 images), 3% from “safe” free accounts (under 1000 images), 11% from Flickr Commons (both pro & non-Pro), and 1% were from accounts that are now deactivated or have no images.
In the random sample, I found a somewhat different spread – 60% of images were from Pro accounts, 32% from “at risk” free accounts, 6% from “safe” free accounts, 2% Flickr Commons, and 0.25% missing.
Update: an extended sample of all accounts with ten or more links (19374 in total) broadly resembles the top 5000 – 49% Pro accounts, 35% “at risk” free accounts, 4.5% “safe” free accounts, 10% Flickr Commons accounts, and 1.5% missing.
So, some quick conclusions –
- Openly-licensed material gathered from Flickr is a significant source for Commons – something like 7.5m file description pages link to Flickr, almost certainly as a source, about 15% of all files
- A substantial amount of material sourced from Flickr comes from a relatively small number of accounts, some institutional and some personal (this was the most common one in my random sample – 58k images)
- A substantial portion of our heavily used Flickr source accounts are potentially at risk (note that it is not possible to tell how many were once Pro, have lapsed because why bother when it’s free, and may resume paying)
- It is not as catastrophic as it might at first appear – the samples all suggest that only about a third of potential source images are at risk, once the Flickr Commons accounts are exempted from the limits – which seems to be the plan.
- Having said that, the figure of 97% of individual free accounts having under a thousand images is no doubt accurate, but probably masks the sheer number of images in many of the larger accounts.
Some things that would potentially still be very interesting to know –
- What proportion of freely-licensed images are from at-risk accounts?
- What proportion of images in at-risk accounts are actually freely-licensed?
- What proportion of freely-licensed images on Flickr have (or could) be transferred over to Commons?
- Are Flickr Commons accounts exempt from the size restriction? (As there are only ~150 of them, this seems plausible as a special case…)