Enhancing XML Preservation and Workflows
- With the proliferation of computers and networked resources the amount of data available on personal computers and on the Web is growing exponentially. Handling of data becomes more complex as the scale expands. Moreover, data collections change in place over time, so that one of the important challenges consists in supporting a life-cycle of data. In particular, relevant parts of information have to be preserved and made accessible for future retrieval. XML is a wide-spread language for encoding documents that is both machine-readable and human-readable. It plays a unique role on the Web and in other areas of digital life: publishers, libraries, warehouses, technical writers, to name a few, - all make extensive use of XML for encoding books, articles, product information, documentation, interchanging and transformation of data, and the extraction and aggregation of relevant pieces of information. Contemporary web services are stepping away from traditional relational data models and are moving towards semi-structured data representations and utilization of NoSQL, and, in particular, XML databases for persisting data. A considerable technological stack has been built around XML, making it even stronger as a format. The development of XML technologies specifications is a never-ending process and emerging implementations of them push the progress further. However, there are still many open problems and challenges to be addressed in the XML domain. This thesis selects a number of the unsolved ones and suggests solutions; concretely (i) Versioning support for XML, (ii) XML databases views as counterparts of relational database views, (iii) Support for XML document templating. Solving the XML versioning problem for a subset of use cases enables an XML persistence layer that tracks the history of changes and provides XML database functionality like querying or indexing of data. The XML database views concepts allows to abstract away from the notion of XML files and think in terms of customizable and editable abstract XML entities with origins in some XML documents. Finally, support for XML document templating facilitates separation of responsibilities while authoring XML documents. Being able to use diverse expertise of developers in different phases of document creation, in turn, optimizes the whole authoring workflow. Highlighting the practical value of this work, all the concepts described in this thesis are implemented and integrated into the TNTBase system. Over the last 4 years TNTBase has been constantly utilized in a number of mainly research projects maturing by receiving feedback from its users. The main target of the TNTBase project was providing a versioned repository for the Open Mathematical Documents (OMDoc) with a strong focus on XML-related functionality. Since the OMDoc format combines data-like aspects (axioms, theorems, examples, etc.) with document-like aspects (sections, paragraphs, etc.), the number of applications was diverse, and therefore all the approaches have been generalized to make it possible to adapt TNTBase to other domains and XML languages. As a result, TNTBase has been deployed in multiple real-life scenarios where it has been used on a daily basis.