The EC Open Data Portal (Beta)

The European Commission's new data portal very quietly went live on Christmas Eve, since when people have been tweeting about it - I thought I ought to take a look and see what's what.

The creation of the portal is very good news. There's a load of data there already, with the majority coming from Eurostat (understandably enough) and the fact that the EC is enthusiastically joining the open data movement is to be warmly welcomed. It's a big step towards greater transparency.

So let's follow a few links from the portal and see what we find. I'll start by simply clicking 'Data' from the main navigation.

Partial screenshot of http://open-data.europa.eu/open-data/data/ All screenshots taken 4 Jan 2013.

Looking at the Top Publishers list, if I click on Eurostat I see a list of their data sets beginning with "Quarterly cross-trade road freight transport by type of transport (1 000 t, Mio Tkm)"

Partial screenshot of http://open-data.europa.eu/open-data/data/publisher/estat

The description of what that means is "Quarterly cross-trade road freight transport by type of transport (1 000 t, Mio Tkm)" - i.e. the same text again. OK, when you get a load of data together for your launch, you’re bound to be short on metadata so I guess it's not surprising that we have the same text doubling as both title and description, especially on datasets prepared before the portal was in operation.

And the links to the data itself? Well there are three labelled with their MIME types - two of which are the same, i.e. we have:

What we actually have is the data in three formats: TSV, DFT and SDMX-ML. These are made explicit as titles on the links which you see on a desktop browser when you mouse-over them.

The first few data points in the TSV file of Quarterly cross-trade road freight transport by type of transport (1 000 t, Mio Tkm)

The data itself is, well, it's data, you can't expect it to mean a lot on its own. You need to be able to interpret it and present it to the audience. The less specialist the audience, the more work you have to do to interpret it of course. The SDMX zip contains two XML files. The first provides details of the structure of the second that actually contains the data but even here there is a distinct paucity of metadata. In order to be able to interpret this data I need to know more.

OK, I've picked on a sample on 1. What about other data sets? It seems that all the Eurostat data is published in this way, i.e. those three formats, each one wrapped in a zip file with precious little metadata.

Let's look at DG CNECT's two data sets.

Partial screenshot of http://open-data.europa.eu/open-data/data/publisher/cnect

Ooh, that looks promising…

Mouse-over the links and…

As previous but with mouse over the first link

The first link (labelled CSV, XLS, SQL, RDF) actually takes you to a Web page with more info and links to those actual data files. The second to A Visualisation tool for selected indicators of the Digital Agenda Scoreboard. So has DG CNECT done loads of work on all its data? Well, not really. The Digital Agenda Scoreboard work has been done by the (large) LOD2 project.

The other data set in the beta portal is data about "ICT research projects under EU-FP7" which is available in XLS (Excel) and PDF. The mouseover text for the Excel file is helpful:

Excel file with two tables: one for data, the other for metadata. A row/record is generated for each partner/organization participating in a project. Projects have multiple partners and an organization can be the partner of multiple projects. This unique table facilitate the consultation of the data but contains redundant information. Four basic tables could be extracted to re-create a relational database: a participations-table linking partners ID to projects ID through the associated EC funding, a projects-table listing their characteristics, a partners-table listing their characteristics, a geo-table with geographical information related to partner legal city.

That's a good attempt at providing metadata that can help you understand what the data means. It's presented as a title on a hyperlink in a Web page so hardly easy to discover but this is a beta - so let's not be too harsh.

The menu option drawing my attention though is the one that says Linked Data. This I want to explore more…

Linked Data

If I go back to the main data page I see the list of most recently updated sets includes a link to "Enterprises using Internet for interaction with public authorities (NACE Rev. 1.1)" This is interesting to me as it mentions NACE - codes for company activities as used in the Registered Organization Vocabulary.

That page offers data downloads in the same format we saw before with Eurostat and more metadata set out in a table - and those fields look at a lot like the kind of thing we're used to seeing in DCAT, Dublin Core and ADMS. This is good.

On the actual Linked Data page linked from the portal's main menu this is confirmed. There is a SPARQL endpoint about which we read:

The SparQL endpoint can be used to query all dataset metadata available on the EC Open Data Portal.
  • European Commission Open Data Portal metadata vocabulary
This metadata vocabulary (using DCAT and DCT vocabularies) is provided as a worksheet specification and as an ontology. It is aligned in general terms to be compatible with ADMS.

Sounds promising. So given that the date of modification of the particular data set I've lighted upon is 2012-03-26, this SPARQL query should find it:

prefix dcterms: <http://purl.org/dc/terms/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
select distinct ?s
where {
  ?s dcterms:modified "2012-03-26"^^xsd:date
}

I tried this with and without datatyping the date. I get no results. Let's try something even more general:

select distinct ?s ?p
where {
  ?s ?p "Enterprises using Internet for interaction with public authorities (NACE Rev. 1.1)"
}

Still no joy. The metadata for this particular item doesn't appear to be in the RDF data store. I tried queries for other datasets with similar results. In my attempts to find data concerning NACE codes I tried this query:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?s ?label
where {
  ?s rdfs:label ?label
  FILTER (regex(?label, "nace","i"))
}

And yay! I got a bunch of results… all concerned with the PANACEA project. Ah well…

URIs

A curious thing to end with. The URI that gets passed around for the portal is http://open-data.europa.eu/. Why the 'open-' I wonder when everyone else doing this uses just 'data' (.gov, .gov.uk, .gouv.fr etc). And then it redirects to http://open-data.europa.eu/open-data/. That looks like something that might change next time there's an upgrade?

I also notice that the URI for Enterprises using Internet for interaction with public authorities (NACE Rev. 1.1) is http://open-data.europa.eu/open-data/data/dataset/LnpLjY7EqriCTVOSxDVMQ - which looks like an auto-generated URI. That's fine, but if the portal is re-built at any time, perhaps including a re-ingestion of the data, will that URI persist?

The link to one of the actual zip files containing the data is: http://open-data.europa.eu/open-data/data/dataset/LnpLjY7EqriCTVOSxDVMQ/resource/e816d7b9-5108-452b-85ba-ea27157d1853

Whatever else that URI may be, cool it ain't.

I get the feeling that not a great deal of thought has been given to URI design in the portal. I may be wrong but that's the impression I get. And an early consideration of URI persistence matters when we're talking about referring to data sets. It gives me less confidence that I can link to things. See the recent study on this issue that I and others completed recently (broken link removed, see footnote).

Conclusion

That the EC has created its open data portal is very good news and is to be applauded. It's in beta, so there are, of course, things that need sorting out and I very much look forward to that happening. The two main issues for me being:

  1. metadata - there needs to be more of it and it needs to be more accessible to both machines and humans
  2. URI persistence matters.

I hope both issues are being addressed.