XML assumes that documents can be represented by a single hierarchy of elements, each nesting neatly inside its parent. However, in the real world, documents are more complicated, and often contain overlapping structures: in a book, pages overlap with sections; in a play, speeches overlap with lines; in a collaborative document, comments overlap with each other.
There are many ways in which such documents can be represented, from using empty elements or processing instructions as milestones in XML, to inventing a new syntax such as LMNL syntax. Whichever approach is chosen, the next problem is how to check that the document is valid.
This paper will present Creole (Composable Regular Expressions for Overlapping Languages etc.). Creole is an extension to RELAX NG, and follows its philosophy of defining a pattern that matches a valid document. Other schema languages for overlapping markup, such as SGML’s CONCUR and Rabbit and Duck Grammars, consider a single document as multiple documents, each of which is validated separately. Creole treats the document as a whole, and markup languages that use overlap as separate languages in their own right.
As the paper will show, Creole, like RELAX NG, is readily implementable using Brzozowski derivatives: considering a document as a stream of events, the derivative of a pattern with respect to an event is a new pattern that should match the remaining events. Since every syntax for overlap can be mapped onto a stream of events, Creole can be applied whatever representation is used.
This paper will detail the syntax of Creole and the algorithm for its implementation, as well as providing examples of Creole schemas and describing its XSLT 2.0 implementation.
Jeni Tennison is an independent consultant specialising in XSLT and XML schema development, currently contracted to TSO. She trained as a knowledge engineer, gaining a PhD in collaborative ontology development, and since becoming a consultant has worked in a wide variety of areas, including journal publishing, medieval manuscripts, legislation and financial services. She is author of several books including “Beginning XSLT 2.0” (Apress, 2005).
Jeni was an invited expert on the W3C’s XSL Working Group during the development of XSLT 2.0 and was one of the founders of the EXSLT initiative to standardise extensions to XSLT and XPath. She is currently working on the XProc pipeline definition language as an invited expert on the W3C’s XML Processing Working Group, on the Layered Markup and aNnotation Language (LMNL), and on the DataType Library Language (DTLL).