On open government data, Tim Berners-Lee is almost right

Tim Berners-Lee gave a great talk at the recent Gov 2.0 Expo in which he describes the criteria for creating open and linked government data. In the beginning of his talk he describes a star-based rating system for putting data up in machine readable format, open formats, as a CSV file, etc. As with many things that Tim does, he almost completely had me until he started describing what “linked data format” is in his mind. His notion of linked data is that the values of attributes in a data table would be URLs to some web page somewhere that points to the “definitive” source of data about that thing. There are several reasons why this is incredibly short-sighted and wrong:

  1. URLs link to a specific html page on a specific web server. They are only as permanent as long as the web server owner decides to keep it running. We’ve all encountered “404 Errors” when we go to a web page that is no longer where it used to be, and I certainly wouldn’t want vital government information that needs to persist for decades if not centuries into the future reliant on the HTML link standard.
  2. Where’s the one definitive URL for all of the information about a city, country, or any other place for that matter? Do we really expect government agencies to solve this problem when so many have tried and failed before?
  3. URLs tend not go be good for multi-lingual content. Where on Wikipedia is the single definitive URL for Paris, France? If you are English speaking, it is here: http://en.wikipedia.org/wiki/Paris but if you are a french speaker it is here: http://fr.wikipedia.org/wiki/Paris Those are different URLs that contain different information, and there are dozens of others on just the one website that are about Paris. Which one would Tim link to?
  4. The entire system of URLs relies on the HTML syntax, which Tim invented, so it is understandable that he is partial to it, but those that care about open government data also want to ensure that is archived so that people a thousand years from now can easily use the data. No offense to Tim and his amazing accomplishment of creating the Web as we know it, but there’s no way that a URL is going to be valid in a thousand years.

So what is the right type of linked data? The answer has been in place for a long time, and just needs to be used more consistently (much in the same way that CSV as a data format should be used more consistently) — unique identifiers or keys. The US government has been doing this for years. Political entities in the US all have FIPS codes and every known place in the world has been assigned a GNIS Feature ID. The US EPA publishes the Facility Registry System that uniquely identifies all EPA regulated facilities in the United States.

None of these identifier systems is perfect, but what they do allow is for a common way to refer to unique entities within the context of a given agency’s data. Yes, this will require a lot of work, but it is without a doubt the easiest path forward that yields the best results. For extra credit, agencies should utilize Universally Unique Identifiers (UUIDs) that have absolutely no semantic identifiers within them so that everyone, regardless of language or location, could share them.

Once government agencies (and the private sector too) start publishing their data with unique identifiers for common references, we can start to see a real ecosystem of data start to emerge. The connectors or links in this ecosystem are the unique identifiers, not server or location dependent URLs.

My company, Rhiza Labs, specializes in helping government agencies, non-profits and corporations future-proof and publish their data. We’ve found that once a few simple steps are followed, there’s a huge payoff in more data being used in decision making, planning and collaboration.