Many XML languages are defined in two steps, the first in terms of a mapping from XML documents to an abstract data model, the second by defining the meaning of the constituents of the abstract data model with respect to some domain. One obvious example is (X)HTML+CSS, where the first step is from document to nested boxes with properties, and the second is a set of claims that boxes+properties make on renderings. Another is W3C XML Schema, which explicitly separates the mapping from schema documents to schemas on the one hand from the schema-validation semantics of the schema components which make up schemas on the other. A third is RDF as notated with XML, where the distinction is manifest in the official title of the original RDF Recommendation: “RDF Model and Syntax”.
These three examples exhibit an increasing distance between the structure of the XML document on the one hand and the corresponding constituents of the abstract data model on the other: for HTML it is very nearly one-to-one, for W3C XML Schema it is often one-to-one, but in important cases it isn’t, and for RDF it is often not one-to-one. The most obvious reason for this kind of difference is that XML documents are essentially tree-structured, but data models need not be. The HTML data model is essentially tree-structured, the W3C XML Schema component structure is quite close to being tree-structured, and RDF’s data model is not tree-structured at all.
This paper describes a novel approach to stating what it calls the “proximate semantics” of an XML language, that is, the mapping from XML information sets to language-specific (abstract) data models. The approach has three parts:
1) A set of conventions for constructing UML models, using the Violet
open source graphical UML diagram editor;
2) A pipeline of XSLT stylesheets to convert the XML representation
of those diagrams to OWL ontologies;
3) A set of guidelines for writing XSLT stylesheets or other
transformations (e.g. pipelines) to implement GRDDL-triggered
mapping from language documents to data model instances expressed
in RDF.
The result of implementing this approach is that an OWL ontology for a language data model and an RDF instance corresponding to an individual language document can be combined and checked for consistency. The result, if consistent, can then also be compared to (RDF expressions of) concrete data model instances from an implementation. This would enable semi-automatic conformance testing, if the language specification actually included the three parts listed above.
Throughout the paper the points under discussion are illustrated with examples taken from the XML Processing Model language, currently under development by the W3C XML Processing Model Working Group.
Connections are also made to earlier work on expressing data-binding information via schema annotations, which suggest the possibility of auto-generating the stylesheets required for part (3) above in some cases.
Henry S. Thompson divides his time between the School of Informatics at the University of Edinburgh, where he is Reader in Artificial Intelligence and Cognitive Science, based in the Language Technology Group of the Human Communication Research Centre, and the World Wide Web Consortium (W3C), where he works in the XML Activity.
He received his Ph.D. in Linguistics from the University of California at Berkeley in 1980. His university education was divided between Linguistics and Computer Science, in which he holds an M.Sc. While still at Berkeley he was affiliated with the Natural Language Research Group at the Xerox Palo Alto Research Center, where he participated in the GUS and KRL projects. His research interests have ranged widely, including natural language parsing, speech recognition, machine translation evaluation, modelling human lexical access mechanisms, the fine structure of human-human dialogue, language resource creation and architectures for linguistic annotation. His current research is focussed on the semantics of markup, XML pipelines and more generally articulating and extending the architectures of XML.
He was a member of the SGML Working Group of the World Wide Web Consortium which designed XML, a major contributor to the core concepts of XSLT and W3C XML Schema and is currently a member of the XML Core, XML Schema and XML Processing Model Working Groups of the W3C. He has been elected twice to the W3C TAG (Technical Architecture Group). He is lead editor of the Structures part of the XML Schema W3C Recommendation, for which he co-wrote the first publicly available implementation, XSV. He has presented many papers and tutorials on SGML, DSSSL, XML, XSLT, XML Schema and XML Pipelines in both industrial and public settings over the last ten years.