Open Data — A Summer Summary

This is the long-form of a keynote speech I gave at the Samos Summit on Open Data and Interoperability for Governance, Industry and Society [broken link removed]. The slides aren't very informative as on this occasion I chose to use images as the main visual aid rather then a succession of bullet points. I have included more detail here than was possible in the talk but have followed the same structure.

The Samos event is the most recent in a series of open data events that I have personally attended and participated in since the end of May:

Utrecht 31 May: Open Overheid Congres
Copenhagen 6-7 June: European Data Forum [broken link removed]
Dublin 11 June: Multilingual Web & Linked Data Workshop
Brussels 18 - 22 June
- SEMIC 2012 [broken link removed]
- Using Open Data for Policy Modelling (PMOD)
- The Digital Agenda Assembly [broken link removed]

As Mike Wilson of STFC pointed out in Samos — those are just the ones I went to. There have been many more. Open data is a big topic and a lot of work is going into it. Not bad for a movement that started a very countable number of years ago.

In attending all those events, and participating in the W3C's various eGov activities, I notice a number of themes and challenges (the following also acts as a kind of ToC for this lengthy article):

Interoperability & multilingualism
Fear
Understanding & Empowerment
Context & responsibility
Show me the money
A feel good story to end

Interoperability & multilingualism

The statue of Pythagoras in Samos. Photo by Patrick Comerford source

When I first became involved in W3C activities, before I joined the Team, I remarked to Richard Swetenham at the European Commission how it seemed that every other person concerned with W3C was called Dan. It's all to do with interoperability I hear was his memorable reply. And of course there's nothing new about interoperability — it goes back throughout history although in her talk Fenareti Lampathaki asserted that the term itself was coined in the 1970s. Technically, interoperability is actually pretty easy. You map one person's terms to another — no doubt with some loss of detail and context along the way, a few compromises here and there, but basically it's really not that hard if you want to do it. My own work in the EU's ISA Programme is all part of making it easier.

But interoperability goes deeper than matching terms. We're talking about national governments: organisations whose practices, laws and cultures go back centuries. And if nation A felt that nation B's way of doing things was actually better — they'd do it! There will always be work to do to achieve data interoperability simply because data is created by humans and humans have histories and national guidelines and "that's what works for us" to contend with. So please don't expect universal interoperability between nation states. And anyway, wouldn't it be boring if we actually were all the same?

And then, of course, there's the issue of language. Multilingualism is surprisingly absent even from the most high profile vocabularies. This came into focus for me at the Multilingual Web & Linked Data Workshop. In his talk at that event, Jose Emilio Labra Gayo, Associate Professor University of Oviedo (who proudly lists Jose Manual Alonso and Martín Alvarez Espinar among his former students) presented some best practices for multilingual linked data (slides PDF). I am hopeful that this is going to lead to a more formal W3C document in this area (probably a Note from the Internationalisation group) that can be referenced/included in the Government Linked Data WG's Best Practice document. But have you noticed that the key vocabularies we all rely on and re-use continually are mono-lingual. Ever seen a translation of the Dublin Core Metadata set?

I asked Makx Dekkers about this. He ran DCMI for many years (and is now working as an independent with me and PwC on the ISA Programme). Apparently there are more than twenty translations of Dublin Core — that is, localised labels for the URIs — but none of them are officially recognised, mainly because DCMI could never agree on a given translation. For example, there are two French translations I'm told, but they're not the same. At W3C we handle this a little differently. Many of our Recommendations are translated; for example, the RDF Primer is available in Chinese, French [broken link removed], Hungarian and Japanese. That may seem like a random selection of languages and it is — it's the ones that people have contributed and that W3C recognises as being of sufficient quality — however, in case of a discrepancy, the original English version is normative. That seems like an approach that could be used to increase the number of translations of things like Dublin Core, FOAF and schema.org. WDYT Dan?

Fear

Fear of the unknown is a very human emotion and a conversation I had recently over a glass of wine at the Digital Agenda Assembly is typical of the way it applies to open data. I came to the standards world through working in online safety and because of that I know a number of people from film and game classification bodies. The conversation went something like this:

Me:: Please make your classification data available in machine readable format, preferably as LOD with a SPARQL endpoint and provide a simple JSON API too.
Him:: Anyone that wants our data can send me an e-mail and if the reasoning is good then of course they can have it.
Me:: Why do I have to do that? Why can't you just publish it for anyone to use?
Him:: Because you might misuse it.
Me:: What's the worst I can do? Publish something on my Web site that says "this game/film/DVD is rated suitable for all ages" when in fact it's full of sex and violence and barely suitable for adults let alone children?
Him:: Yes.
Me:: I can do that now.

You can't stop people doing the wrong thing, but you can make it easier to do the right thing. You can make your data available directly so people can access the original, bona-fide data that you control, and it can be used on the Web. I'm very pleased to say that the conversation is ongoing and that there are signs of convergence.

When you publish data you lose control of what it's going to be used for and how it is going to be used. As Andrew Stott says: data doesn't know what it's for. It is a perfectly legitimate concern that data owners have and there is no easy answer to it. All we can do is to prove, by example, that errors in data are understood to be part of the deal. The Open Knowledge Foundation's CKAN software that is used in many portals around the world encourages data providers to make declarations about data quality. Even so, individuals have an understandable fear of being made to look stupid in public and/or in the press. And let's be honest, newspapers are not exactly shy about creating stories that make good people look bad (and bad people look worse).

If there is an answer then it's in the open data community making a sustained effort to publish a stream of success stories for the data, an idea I will return to later.

Understanding & Empowerment

Image © 2008 Melissa McKenney, cc Attribution. source

This is perhaps the area of most direct concern to the Crossover Project. Some of its research roadmap is available for comment [broken links removed] with the rest due imminently.

What is the public saying about different topics? What is the public understanding of the issues? If open government data is one half of the equation then social media is the other half. What do people say in reaction to seeing visualisations of data, of browsing all sorts of statistics about their area? The topics of information extraction, opinion mining and sentiment analysis are the subject of a huge amount of research. When I'm not wearing my W3C hat, I also represent i-sieve Technologies [broken link removed] which provides custom sentiment analysis services to ad agencies and the like so it's a topic that I try and watch. It's the Facebook groups (which are open), the comments on blogs and mainstream media articles that form the main feedback. Twitter isn't a lot of use in this regard as there is too little information and zero context to work with so you really can't make a lot of sense of Tweets. But you can do named entity recognition for people and places pretty accurately, at least in English, with Wikipedia as the de facto named entity catalogue.

But is open data truly empowering? There were several papers on this topic at the Crossover workshop Using Open Data for Policy Modelling (PMOD). In his paper Including all audiences in the government loop: From transparency to empowerment through open government data (PDF) Martin Murrilo of Cape Breton University said: …transparency is not enough for the reduction of corruption as other necessary conditions must also be present: that the citizen must be able to receive available information, that the different audiences can understand such information, and that there exists a mechanism to hold the government accountable (i.e. free and fair elections and other checks and balances, generally present in a democratic system). It is not enough to publish data, the public needs to be able to understand the information reflected in the data and have the means to effect a change. If a dictatorship publishes a load of data, that might tell you how corrupt they are, but you may not be able to do anything about it.

Katleen Janssen and Helen Darbishire raised similar issues in Using Open Data: is it really empowering? (PDF):

How can we keep the open data community from indulging in navel-gazing and assuming that the availability of data sets automatically empowers the citizens?
Should we start thinking not only about our right to open data, but also about the possible responsibilities this right brings along?
How do we ensure that governments publish relevant information in re-usable formats as part of their obligations under the right of access to information?

There remains a lot to research here and a swathe of assumptions that will need to be challenged. On which point we take a break and spend two minutes watching some of the most famous names on the Internet facing an interview panel of 10 year olds. It is probably worth noting for non-Brits that the last person interviewed is Prince Andrew, the Queen's second son (who would normally be addressed as Your Royal Highness).

[The video has been removed from YouTube, as have many things that feature Prince Andrew]

My point in showing this video is that everyone, even people as celebrated as those in this video, need to listen to the questions being asked and to answer them appropriately, not to give an answer that, to the questioner, may be meaningless and deeply unimpressive, however accurate it may be.

This idea of ensuring that releasing the data is only part of a bigger task is captured very well by Tim Davies in his PMOD paper Supporting open data use through active engagement. This came out of a discussion at UKGovCamp back in January. I've included the basic 5 stars below but do check out Tim's paper for the background and detail.

Yellow star Be demand driven

Yellow star Put data in context

Yellow star Support conversation around the data

Yellow star Build capacity, skills & networks

Yellow star Collaborate on data as a common resource

Context & responsibility

Developers see data sets as a black box;
really like APIs, especially if they return JSON;
Footnotes? History? Meaning?
Do developers understand the responsibility they have?
Metadata is for people you don't know (Sharon Dawes, presenting her paper A Realistic Look at Open Data at PMOD.

That tells you a lot doesn't it?

… and that's the point. As Sharon Dawes of the University at Albany pointed out at PMOD, developers often see data in general and APIs in particular as a black box. I send a query, I get back an answer and that's what I use in my application. And boy do developers love APIs that return JSON. One line of code and the browser will parse it — Bob's your uncle. But what about the footnotes in the table behind that API? Where's the meaning? Do developers understand the responsibility they have not to treat data as a black box?

There's a lot more to it than that. In her paper, Sharon Dawes highlights the 637 page manual that goes with the US Census data and just how small a fraction of a Tweet is represented by the 140 characters you see on the screen. The metadata is more than dcterms:title and dcterms:creator, it's context, it's provenance, it's annotations that say "that figure right there — we're not sure about that, it's an estimate" or whatever may be the appropriate note. The Data Cube vocabulary, under further review and development in the GLD WG, provides a framework for including this in linked data but it may need some new input. Both the Austrian and Basque governments have recently developed their own vocabulary for describing data sets — and yes, they know all about DCAT and VoID. There's a tech job to be done and, more importantly, there's an educational job to be done too if the whole picture is to be conveyed. So let's remove the black background…

Developers see data sets as a black box;
really like APIs, especially if they return JSON;
Footnotes? History? Meaning?
Do developers understand the responsibility they have?
Metadata is for people you don't know (Sharon Dawes, presenting her paper A Realistic Look at Open Data at PMOD.

Keith Jeffrey of STFC (who is also President of ERCIM, the European host of W3C) pointed me to his work on CERIF [broken link removed] which was designed for a different purpose but could well be very relevant to this work.

Reflecting on this further I suggest we can make a near direct comparison between the [government data] → [developer] → [public] chain and the [government press office] → [journalist] → [public] chain. Professional journalists are well trained and understand the need to verify facts from multiple sources. Court reporting is subject to very stringent rules. What's the equivalent qualification for developers I wonder?

Show me the money

And so to the money. There are two issues here, one of which seems to have very broad consensus: the value is in the interpretation of the data, turning raw facts into useful information. As we've discussed, that's a skilled job and skilled people should be paid to do it. The harder part of the money question is how to reconcile the fact that publishing anything costs money and therefore giving data away for free — as I personally believe must be the case for any data to be truly 'open' — is a burden.

I've had a couple of conversations around this point recently, particularly as it relates to company register information. During the Digital Agenda Assembly there was a session on open data that had Paul Farrell of the European Business Register sat alongside Chris Taggart of Open Corporates. There isn't a gap between those two, there's a chasm. In Samos we heard from Alexander Balthasar, head of the Institute for State Organisation and Administrative Reform Federal Chancellery of the Republic of Austria. In both cases the argument is much the same and you hear it from others too. Why should tax payers be required to fund the publication of government data so that individuals can take it for free and make money out of it?

As I mentioned right at the start, interoperability is as much about culture and history and legal frameworks as it is about whether you record a person's multiple given names in multiple fields in a database or stick them altogether in one. In Austria, if you want to get company registration data you can. All you have to do is to go to court to request it, or you can take the shortcut and go through your Notary, and pay the relevant fee and the data is yours. Whether you call that open data or not — I don't, Dr Balthasar does — that's the way it is in Austria and to change it would require not just a change in culture but an act of Parliament. It's that kind of background from multiple Member States that is behind Paul Farrell when he explains that the EBR data is not freely available (all you have to do is register and pay and… you decide if that's open data or not).

So how do we tackle this? Well, one way is to do what the Internet does best when it comes across an obstacle — go round it. This is what Open Corporates does when it screen scrapes company data rather than getting it directly from the company register. But the other way is, I think, at the heart of Nigel Shadbolt's keynote that he delivered in Copenhagen and repeated via Skype for the audience in Samos.

We need a constant stream of success stories. Proof that this works, proof that businesses can be built and that a demand can be met through the interpretation, visualisation and delivery of information based on open data. It was really good in this context to see Avi Yaeli of IBM Israel's presentation in Samos: An industry view on Open Data {broken link removed] which was full of such examples. The challenge, I think, is to find a way to get some of the money made back, not just to government, but to the person whose budget has to cover the cost of publishing the data. The prime example of this is the UK government's central funding of the Ordnance Survey's open data (see this article for background). We need to do more though. Imagine you're the person that has to decide whether to publish your data for free or make x number of people redundant. What would you do? Free open data does need to be paid for and yes, it should be seen as part of the cost of running a public service, but the concerns expressed around loss of revenue in that department are legitimate. I remember Roger Cutler of Chevron, a larger than life character at many W3C events until his retirement last year, constantly saying that yes, Chevron is a big company with lots of money, but W3C membership fees come out of a departmental budget and that's very limited.

A feel good story to end

Let me end with a feel good story. I didn't include this in the talk as it was being presented fully the next day but it's a good one and I want to share it.

At the Digital Agenda Assembly, Chris Taggart presented the latest review of access to company register data across Europe. Given the above it will come as no surprise that Austria does not come out well. Nor do Romania, Spain, Slovenia or Greece. But…

In Greece, all public sector contracts must be published online. If there is no online contract, there is no contract. Cool. So what Michalis Vafopoulos and his team at NTUA have done is to process those contracts and create Public Spending.gr [broken link removed] — a service (in Greek and English) that tracks all that money (see his PMOD paper: Publicspending.gr: interconnecting and visualizing Greek public expenditure following Linked Open Data directives PDF).

Public Spending.gr offers various breakdowns and visualisations of the data, and a SPARQL endpoint. In creating this service, Michalis has derived a register of every business in Greece that has received any money at all from the public purse. And then published that data. This is very much in line with Chris Taggart's mantra of ask for forgiveness afterwards rather than permission beforehand.

What does the person in charge of the actual Greek business register think of this? Is he ready to sue NTUA from here to the Peloponnese? Oh no… he loves it and wants to add in the data about all the other companies registered in Greece too!

Make a note of that — it's a success story.

14 July David F. Flanders wrote:

A great presentation turned blog post by @philarcher1 on some of the barriers standind in the way of sharing data. Some of my 'liked' quotes include:

Technically, interoperability is actually pretty easy. You map one person's terms to another — no doubt with some loss of detail and context along the way, a few compromises here and there, but basically it's really not that hard if you want to do it. ← blog post coming on the continuum of data sharing transport protocols, e.g. CSV, Key:Value, oData/gData, RDF…

Me:: Please make your classification data available in machine readable format, preferably as LOD with a SPARQL endpoint and provide a simple JSON API too.
Him:: Anyone that wants our data can send me an e-mail and if the reasoning is good then of course they can have it.
Me:: Why do I have to do that? Why can't you just publish it for anyone to use?
Him:: Because you might misuse it.
Me:: What's the worst I can do? Publish something on my Web site that says "this game/film/DVD is rated suitable for all ages" when in fact it's full of sex and violence and barely suitable for adults let alone children?
Him:: Yes.
Me:: I can do that now.

Phil also references Nigel Shadbolt's list of 'data hugging' characteristics, which I decided to throw up this quick survey (more to be done with anon): http://bit.ly/DATA-FUD-SURVEY (PA - not hyperlinked, this leads to a Google doc for which you need to be logged in).

…developers often see data in general and APIs in particular as a black box. I send a query, I get back an answer and that's what I use in my application. And boy do developers love APIs that return JSON. One line of code and the browser will parse it — Bob's your uncle. But what about the footnotes in the table behind that API? Where's the meaning? Do developers understand the responsibility they have not to treat data as a black box?... Reflecting on this further I suggest we can make a near direct comparison between the [government data] → [developer] → [public] chain and the [government press office] → [journalist] → [public] chain. Professional journalists are well trained and understand the need to verify facts from multiple sources. Court reporting is subject to very stringent rules. What's the equivalent qualification for developers I wonder?

This quote is the only one I slightly disagree with, as it is stated as though it is the developer who wants to see things as black boxes. This worries me greatly given that it is developer who have the skills to interpret data programmatically (and yes they want to use these skills!), yet often managers will treat dev as if they shouldn't care about the wider context, rather developers are seen as some kind of 'brick-layer' whose tasks should be restrained to coding more 'walls' in the manager's architecture, instead of interpreting data?! In short, the analogy goes that newspapers encourage their journalists to follow leads and look down the rabbit hole, managers of developers (on the whole) do not.

Reply

Thanks very much, David. I take your point on the black box issue. I could pass the buck and say I was repeating what I'd heard at PMOD but, OK, I chose to repeat it. I guess the general point — and one that is clear throughout the current debate — is that it is the responsibility of everyone concerned not to treat data as a black box.