All Our Digital Eggs

In December, Google discontinued its Google Research Datasets service. The idea behind the service was great: Google provided scientists who needed to share very large datasets with storage space in the Google cloud of servers. Their decision to cut the service is part of larger belt-tightening effort as a result of an alarming 68% drop in their fourth quarter profit from the previous year. I don’t blame Google for taking this action, but it nonetheless is a jarring example of how putting all of your data eggs in one basket can be very dangerous.

It’s great to see researchers and others in the public sector sharing more and more of their data. Trouble is, most of the data they’re sharing exist on one server, housed either on-site, or by third parties like Google’s now defunct service or Amazon.com Public Data Sets. The problem with this particular approach is that when servers crash, companies decide to drop their services or political winds change, the data disappear forever.

Our ancestors made this same mistake during the third century BC with the creation of the Library of Alexandria. The Library was charged with collecting all of the world’s knowledge, which it accomplished with monetary and other support from the royalty of the time. When the Library was destroyed, most of the source copies of much of the world’s documented knowledge vanished along with it.

Will we repeat this mistake, or is there a better way„

Of course there is. The humble public library models one of the most impressive methods for providing open access to information while also preserving knowledge. Since the early 20th century, the U.S. public library system has distributed thousands of copies of millions of books to libraries throughout the country. These books are accessible for free to any citizen, regardless of education or any other consideration. Even when books leave the collection through theft, vandalism, loss, or even organized book burnings (which still happen, BTW), the far-flung distribution of multiple copies of each book makes it difficult to destroy all instances. Furthermore, the interlibrary loan system guarantees that you can walk into any individual library and access almost any book — even if the particular library does not have it in stock, it will be transferred from another library for your use within a short period time.

Can we say the same thing for the availability of public data online„ Most of the time, the answer is no. It is not enough for a public agency to publish raw data on its server in the hopes that others will download and access them through APIs. While this is an excellent first step, society in general and technologists specifically need to learn from the public library system and create a similar digital infrastructure to preserve and increase the availability of our combined human knowledge base.

Rhiza Labs uses an open architecture called the Information Commons which provides the means to massively replicate data across a distributed system of servers. This distributed system relies on servers, or peers, that are owned and operated by separate entities, ensuring that no one company, organization or agency controls all of the data. If one server goes down, recovery is simple since the data are replicated elsewhere in the Information Commons. This architecture provides for security and proper attribution of data, the preservation of intellectual property status, and many other concerns that most organizations have when faced with sharing their data more broadly.

You don’t have to be a programmer to make use of the Information Commons. All of Rhiza’s products use this architecture as their database, which means that all of the organizations using Rhiza’s products are able to reuse and share their data, even though their specific web applications are running on different servers — kind of like the interlibrary loan system.

Let’s stop putting all of our data eggs in one basket. There’s no reason not to — the technology to store our data in the digital equivalent of a modern library, instead of the Library of Alexandria, already exists. We can use this technology to ensure that public data flows freely to everyone who wants to access it, and to preserve it for all time.

We at Rhiza are looking to join with others who share our enthusiasm for continuing to build the Information Commons. Leave a comment here or contact us directly and let us know what you think.