free ebooks
 Home Computers & InternetProgramming > General  
A Query Language for XML

1. Introduction

The goal of XML is to provide many of SGML's benefits not available in HTML and to provide them in a language that is easier to learn and use than complete SGML. These benefits include user-defined tags, nested elements, and an optional validation of document structure with respect to a Document Type Descriptor (DTD).

One important application of XML is the interchange of electronic data (EDI) between two or more data sources on the Web. Electronic data is primarily intended for computer, not human, consumption. For example, search robots could integrate automatically information from related sources that publish their data in XML format, e.g., stock quotes from financial sites, sports scores from news sites; businesses could publish data about their products and services, and potential customers could compare and process this information automatically; and business partners could exchange internal operational data between their information systems on secure channels. New opportunities will arise for third parties to add value by integrating, transforming, cleaning, and aggregating XML data. In this paper, we focus on XML's application to EDI. Specifically, we take a database view, as opposed to document view, of XML. We consider an XML document to be a database and a DTD to be a database schema.

EDI applications require tools that support the following tasks:

  • extraction of data from large XML documents,
  • conversion of data between relational or object-oriented databases and XML data,
  • transformation of data from one DTD to a different DTD, and/or
  • integration of multiple XML data sources.

Data extraction, conversion, transformation, and integration are all well-understood database problems. Their solutions rely on a query language, either relational (SQL) or object-oriented (OQL). We present a query language for XML, called XML-QL, which we argue is suitable for performing the above tasks. XML-QL has the following features:

  • It is declarative.
  • It is ``relational complete''; in particular, it can express joins.
  • It is simple enough that known database techniques for query optimization, cost estimation, and query rewriting could be extended to XML-QL.
  • It can extract data from existing XML documents and construct new XML documents.
  • It can support both ordered and unordered views on an XML document.

An initial draft of the query language is a W3C note,

One salient question is why not adapt SQL or OQL to query XML. The answer is that XML data is fundamentally different than relational and object-oriented data, and therefore, neither SQL nor OQL is appropriate for XML. The key distinction between data in XML and data in traditional models is that is XML is not rigidly structured. In the relational and object-oriented models, every data instance has a schema, which is separate from and independent of the data. In XML, the schema exists with the data as tag names. For example, in the relational model, a schema might define the relation person with attribute names name and address, e.g., person(name, address). An instance of this schema would contain tuples such as ("Smith", "Philadelphia"). The relation and attribute names are separate from the data and are usually stored in a database catalog.

In XML, the schema information is stored with the data. Structured values are called elements. Attributes, or element names, are called tags, and elements may also have attributes whose values are always atomic. For instance, <person><name>Smith</name><address>Philadelphia</address></person>.  is well-formed XML. Thus, XML data is self-describing and can naturally model irregularities that cannot be modeled by relational or object-oriented data. For example, data items may have missing elements or multiple occurrences of the same element; elements may have atomic values in some data items and structured values in others; and collections of elements can have heterogeneous structure. Even XML data that has an associated DTD is self-describing (the schema is always stored with the data) and, except for restrictive forms of DTDs, may have all the irregularities described above. Most importantly, this flexibility is crucial for EDI applications.

Self-describing data has been considered recently in the database research community. Researchers have found this data to be fundamentally different from relational or object-oriented data, and called it semistructured data. Semistructured data is motivated by the problems of integrating heterogeneous data sources and modeling sources such as biological databases, Web data, and structured text documents, such as SGML and XML. Research on semistructured data has addressed data models, query-language design, query processing and optimization, schema languages, and schema extraction. The key observation in this paper is that XML data is an instance of semistructured data.

In designing XML-QL, we drew from other query languages for semistructured data [1, 2, 5]:  tutorials describing some of the work on semistructured data can be found in [3] and [4].  XML-QL includes most features found in these languages, but it differs from all of them in several important respects. Specifically, this paper makes the following contributions:

  • We propose a data model for XML data that extends the semistructured-data model with order. This extension is necessary for XML documents, which are ordered.
  • We design a syntax for XML-QL that combines elements of the XML syntax with traditional query-language syntax.
  • We propose a novel semantics for XML-QL to support order in the input and output data.
  • We combine two powerful data-construction mechanisms, nested queries and Skolem functions, in a novel way.
  • We illustrate that XML-QL can be used for the tasks it has been designed for, such as data extraction, transformation, and integration.


 Additional Info
 No. 281
 Posted on 8 June, 2006
Bookmark This Page
Facebook MySpace Twitter Digg stumbleupon friendfeed Delicious


Link to us from your website or blog by using the code below in your html
@2008 ebooklobby privacy policy email: info [at]