Quality versus age of Wikipedia’s Featured Articles

April 16th, 2010 by

There’s been a brief flurry of interest on Wikipedia in this article, published last week:

Evaluating quality control of Wikipedia’s feature articles – David Lindsey.

…Out of the Wikipedia articles assessed, only 12 of 22 were found to pass Wikipedia’s own featured article criteria, indicating that Wikipedia’s process is ineffective. This finding suggests both that Wikipedia must take steps to improve its featured article process and that scholars interested in studying Wikipedia should be careful not to naively believe its assertions of quality.

A recurrent objection to this has been that Lindsey didn’t take account of the age of articles – partly because article quality can degrade over time, since the average contribution is likely to be below the quality of the remainder of the article if it began at a high level, and partly because the relative stringency of what constitutes “featured” has changed over time.

The interesting thing is, this partly holds and partly doesn’t. The article helpfully “scored” the 22 articles reviewed on a reasonably arbitrary ten-point scale; the average was seven, which I’ve taken as the cut-off point for acceptability. If we graph quality against time – time being defined as the last time an article passed through the “featuring” process, either for the first time or as a review – then we get an interesting graph:

Here, I’ve divided them into two groups; blue dots are those with a rating greater than 7, and thus acceptable; red dots are those with a rating lower than 7, and so insufficient. It’s very apparent that these two cluster separately; if an article is good enough, then there is no relation between the current status and the time since it was featured. If, however, it is not good enough, then there is a very clear linear relationship between quality and time. The trendlines aren’t really needed to point this out, but I’ve included them anyway; note that they share a fairly similar origin point.

Two hypotheses could explain this. Firstly, the quality when first featured varies sharply over time, but most older articles have been brought up to “modern standards”. Secondly, the quality when first featured is broadly consistent over time, and most articles remain that level, but some decay, and that decay is time-linked.

I am inclined towards the second. If it was the first, we would expect to see some older articles which were “partially saved” – say, one passed when the average scoring was three, and then “caught up” when the average scoring was five. This would skew the linearity of the red group, and make it more erratic – but, no, no sign of that. We also see that the low-quality group has no members older than about three years (1100 days); this is consistent with a sweeper review process which steadily goes through old articles looking for bad ones, and weeding out or improving the worst.

(The moral of the story? Always graph things. It is amazing what you spot by putting things on a graph.)

So what would this hypothesis tell us? Assuming our 22 are a reasonable sample – which can be disputed, but let’s grant it – the data is entirely consistent with all of them being of approximately the same quality when they first become featured; so we can forget about it being a flaw in the review process, it’s likely to be a flaw in the maintenance process.

Taking our dataset, the population of featured articles falls into two classes.

  • Type A – quality is consistent over time, even up to four years (!), and they comply with the standards we aim for when they’re first passed.

  • Type B – quality decays steadily with time, leaving the article well below FA status before even a year has passed.

For some reason, we are doing a bad job of maintaining the quality of about a third of our featured articles; why, and what distinguishes Type B from Type A? My first guess was user activity, but no – of those seven, in only one case has the user who nominated it effectively retired from the project.

Could it be contentiousness? Perhaps. I can see why Belarus and Alzheimer’s Disease may be contentious and fought-over articles – but why Tōru Takemitsu, a well-regarded Japanese composer? We have a decent-quality article on global warming, and you don’t get more contentious than that.

It could be timeliness – an article on a changing topic can be up-to-date in 2006 and horribly dated in 2009 – which would explain the problem with Alzheimer’s, but it doesn’t explain why some low-quality articles are on relatively timeless topics – Takemitsu or the California Gold Rush – and some high-quality ones are on up-to-date material such as climate change or the Indian economy.

There must be something linking this set, but I have to admit I don’t know what it is.

We would be well-served, I think, to take this article as having pointed up a serious problem of decay, and start looking at how we can address that, and how we can help maintain the quality of all these articles. Whilst the process for actually identifying a featured article at a specific point in time seems vindicated – I am actually surprised we’re not seeing more evidence of lower standards in the past – we’re definitely doing our readers a disservice if the articles rapidly drop below the standards we advertise them as holding.

Tags: ,

Leave a Reply