Open Data in HTML

Elias Torres, IBM, eliast@us.ibm.com

Introduction
In these Web 2.0-3.0 days, there is a lot of expectation for data publishers to offer their data through APIs, but there is no clear way to encode and query this data in a universal way. There are many ways of encoding information structure and semantics (XML, RDF, JSON, YAML, etc.) but all of them suffer the same illness: they are unsuitable for human presentation. Because of this, most formats end up being transformed into HTML for human consumption. In the process, however, they give up their structure and semantics in order to achieve an elegant presentation. But why do we need to bother with data in HTML in the first place? The main reason is so that humans (with the help of machines) can better navigate, locate, and benefit from the volume, diversity and level of detail of information on the Web today. And so we find ourselves here in 2007, exploring several mechanisms to recharge our HTML pages with semantics, in order to create a meaningful Web for humans to consume in an open fashion. But first, we must agree on the design principles that will help us compare our options.

Design Principles for Open Data in HTML
We don’t expect to find a single technology that succeeds in all these facets, nor should we. These are just important principles to keep in mind when choosing the mechanism to both embed or extract semantic data in HTML.

Authoritative data
Whatever the particulars are for a technique, we want to end up with machine-readable data that supposedly is equivalent to what shows up on the corresponding human-friendly Web page. We want the techniques that get us to that point to ensure the fidelity of this relationship; we want to be able to trust that the data really corresponds to what you read on the Web page.

Expressivity and extensibility
If we use one technique to create a Web page with both a human-representable and machine-readable version of directions to your home, we’d rather not need a different technique to add the weather forecast to the same Web page. We hope that this criteria will help minimize the number of software components involved in any particular application, which in turn increases the robustness and maintainability of the application.

Don’t repeat yourself (DRY)
If the same datum is referenced twice on a Web page, we don’t want to write it down twice. Repetition leads to inconsistencies when changes are required, and we don’t want to jump through too many hoops to create Web pages that both humans and computers can understand. This does not necessarily apply to repetition within multiple data representations, if those representations are generated from a single data store.

Data locality
Multiple representations of the same data within the same document should be in as self-contained a unit of the document as possible. For example, if one paragraph within a large Web page is the only part of the page that deals with the ingredients of a recipe, then you want a technique to contain all of the machine-readable data about those ingredients to that paragraph. Data locality allows us to copy both the human- and machine-oriented representations of the same content at once.

Existing content fidelity
We want the techniques to work without requiring authors to rewrite their Web sites. The more that a technique can make use of existing clues about the intended (machine-consumable) meaning of a Web page, the better. But a caveat: techniques shouldn’t be so liberal with their interpretations that they ascribe incorrect meanings to existing Web pages.