The Importance of Digital Persistence

Back in November 2019 I had reason to spend a chunk of time writing a blog post about some Web history, a document that for legal reasons was only published recently. I had a lot of fun doing that and enjoyed delving back into some very old RFCs, early versions of HTML, and contacting friends and colleagues to mine their memories for more details.

What’s worth noting is that this was possible.

Pages of a book burning in a small fire
Deleting digital documents is the same as book burning, an unnecessary deletion of history Farenheit 451 by Harry McGregor cc-bycc-ncSome rights reserved

The article has 38 hyperlinked external references. These include formal standards, academic papers, documents written for the purpose of recording historical events, and digital or digitised books. In the majority of cases, the documents are put online at stable, persistent locations that should outlive all of us. A couple of things stand out from this list, one positive, one negative.

The primary standards body for the Internet is the Internet Engineering Task Force (IETF). All their standards (known as Requests for Comments, RFCs) are online at a set of predictable URLs such as: https://tools.ietf.org/html/rfc{RFC number} and https://tools.ietf.org/rfc/rfc{RFC number}.txt. This is true right back to RFC 1 from 7 April 1969 – that’s a document from 20 years before HTTP was invented.

Those standards also point forwards and back in time to documents that the current one makes obsolete or that make it obsolete. That makes it easy to track back to the origin. See, for example, the HTTP 1.1 spec from June 1999.

Not only finalised standards, but all drafts are also online at persistent URLs with links to different versions, like this one for example. It didn’t get any traction at the time but if in future someone else thinks up the same idea, then he/she can see if anyone’s done similar work before. It's been at the same URL, unchanged, since 8 June 2001. W3C has the same policy – it’s how you can find out where ideas came from and determine a minimum age for a given idea.

There was one document I really wanted to find but couldn’t – the original definition of the Common Gateway Interface, CGI, that first defined the query string for URLs. RFC 3875, which defines version 1.1, includes a link to the doc at http://hoohoo.ncsa.uiuc.edu/cgi/ but it’s a 404. I am grateful to Herbert Van de Sompel for pointing me to the Memento Web from where copies are available, but it's not the same as the document still being where it has always has been, and with functioning inbound links.

Publishing digital information for long term access is not hard but does require a deliberate policy decision to do so. I’ve written about this previously and have highlighted the benefits for future historians and others simply keen to know why things are the way they are. What’s sad, in my view, is that the number of websites that arrange their content for the long term is relatively small. Online newspapers generally do it, blog software does it out of the box, but corporate websites driven by marketing and communications departments generally don’t. “Why would anyone want to look at an old version of X?” “We don’t want to put out of date information in front of customers” and so on.

If you’re working on any kind of reference document, be it a standard, a policy, a press release, a blog post, a record of an event – anything that people today or in 50 years’ time might point to as a record of your endeavour, I think there is a duty to plan for persistence. You can always edit old documents to say something like “this document no longer applies and is preserved here for archive purposes” – thus showing that the content is no longer in force. Others, including people yet to be born, will thank you for it.

It's been suggested that persistence is difficult to achieve as publishers inevitably update their CMS. I want to push back against that. For documents with long term importance, use whatever tools you like to generate a standalone document and publish that. It should and can be independent of any underlying toolchain.

In the spirit of eating my own dog food, this website does, of course, have a persistence policy. I'm 57. Come back in 40 years' time and see if this page is still here (I won't be) and, more importantly, how many of the references in that post from November are still valid. My guess is most will be, but, sadly, not all.