Going native: Use cases for native XML databases

Copyright 2005-2007 by Ronald Bourret

When native XML databases appeared on the heels of the XML 1.0 recommendation, most people weren't sure what to make of them. Were they a replacement for relational databases or a return to hierarchical databases? No, said the vendors. They were designed to manage large numbers of XML documents. But since many XML documents could be mapped to relational tables, the only obvious targets were document-centric XML, which lacked structure, and catalogs, which had deep hierarchies. Beyond that, it wasn't clear what they could be used for.

1.0 Got XML?

So what are the use cases for native XML databases? As John Merrells, one of the developers of Sleepycat Software's Berkeley DB XML, waggishly put it, there is only one use case, and that is simply, "Got XML?" In other words, if you have more than a handful of XML documents that you need to store, you should store them in a native XML database. The reasons are the same as for storing data in any other database -- ease of management, enhanced query performance, concurrent access, transactional safety, security, and so on.

Other vendors are more specific: native XML databases are used in a wide number of fields -- from genetics to health care to insurance -- and technologies -- from data integration to messaging to Web sites. This article describes the most common use cases and is based on discussions with roughly half of the native XML database vendors, as well as a handful of customers. It does not attempt to determine the size of the current native XML database market -- a task left to the market analysts -- nor does it attempt to describe the capabilities of individual native XML databases, which vary significantly.

While most real-world uses of native XML databases do not fit cleanly into any single category, it is possible to characterize them in terms of a limited number of use cases. The most popular of these are storing and querying document-centric XML, integrating data, and storing and querying semi-structured data. Native XML databases are used in these cases because the data involved does not easily fit the relational data model, while it does fit the XML data model.

Other common use cases for native XML databases include managing long-running transactions, handling rapidly evolving schemas, working with very large documents, querying hierarchical data, and running Web sites.

2.0 A very quick review of native XML databases

There are two different ways to "store" XML documents in a database. The first is to map the document's schema to a database schema and transfer data according to that mapping. The second is to use a fixed set of structures that can store any XML document. To understand the difference, consider how a sales order document might be stored in a relational database. The first method uses a set of tables designed specifically for storing sales orders: Orders, Items, Customers, Parts, and so on. The second method uses a set of tables designed to hold arbitrary XML documents: Elements, Attributes, Text, and so on.

Databases that support the first method are called XML-enabled databases. (Such functionality is also provided by third-party middleware, which is said to XML-enable a database.) Databases that support the second method are called native XML databases. (While our example shows how to build a native XML database on top of a relational database, virtually all native XML databases are built from scratch.) A more theoretically correct way to say this is that XML-enabled databases have their own data model -- relational, hierarchical, object-oriented -- and map instances of the XML data model to instances of their data model. Native XML databases use the XML data model directly.

XML-enabled databases are useful when publishing existing data as XML or importing data from an XML document into an existing database. However, they are not a good way to store complete XML documents. The reason is that they store data and hierarchy but discard everything else: document identity, sibling order, comments, processing instructions, and so on. In addition, because they require design-time mapping of schemas, they cannot store documents whose schema is not known at design time. Native XML databases, on the other hand, store complete documents and can store any document, regardless of schema.

For a complete discussion of native XML databases, see Introduction to Native XML Databases or XML and Databases. In this paper, we will only define two more terms. In a native XML database, a document is the fundamental unit of storage, equivalent to a row in a relational database. While this might be a document in the traditional sense, such as a chapter in a book, it might simply be a set of related data, such as the sequence of a gene, a list of known mutations to the gene, and a list of papers about that gene. The latter sense is important for things like semi-structured data. A collection is a set of related documents and plays a role similar to that of a table in a relational database or a directory in a file system.

And now, on to the use cases.

3.0 Storing and querying document-centric XML

As might be guessed, the most common use case for native XML databases is storing and querying document-centric XML. Of the vendors surveyed, all but one listed it as a use case, and many noted it was their most common use case.

3.1 Documents in the real world

Let's start with some examples of how native XML databases are used with document-centric XML in the real world:

Native XML databases are also used to store a variety of other types of documents, such as contracts, case law, drug information sheets, insurance claims, e-forms, product support procedures, classified ads, and intelligence documents.

3.2 Inside the applications

Applications use document-centric documents in a variety of ways, but most use falls into four broad categories: managing documents, finding documents, retrieving information, and reusing content.

3.2.1 Managing documents

Many applications need to store and retrieve documents. For example, a content management system might add a new document to its data store or a Web server might retrieve a document for display. At the level of a native XML database, managing documents is quite simple. Applications either submit documents to be stored or request documents to be retrieved; the latter functionality uses a document ID, which is usually assigned by the user. (More complex functionality, such as the versioning, check-in/check-out, and workflow facilities found in content management systems, is usually built on top of the database.)

3.2.2 Finding documents

A wide class of applications needs to find whole documents. For example, a Web portal might allow users to search for all documents about a particular company and a content management system might allow users to find all documents relating to a certain part.

The least complex way to search for documents is with full-text searches. In native XML databases, these are XML-aware. That is, they distinguish between content (which is searched) and markup (which is not).

More complex are structured queries, which can query markup, text, or both. (XPath and XQuery are examples of structured query languages; native XML databases support a number of proprietary languages as well.) For example, consider the following queries:

Although these queries are relatively simple, none can be satisfied by a full-text search: the first two queries restrict the search to certain sections of the document and the third doesn't even query the text. It is also interesting to note that these queries do not require all documents to use the same schema -- they only require documents to contain certain common elements that have roughly the same meaning.

3.2.3 Retrieving information

Although documents contain useful data, they haven't traditionally been used as a source of data. XML and XML query languages make that possible. For example, consider the following queries:

These queries are fundamentally different from those that return whole documents to be read or modified. Instead, they answer questions, create reports, or construct entirely new documents.

3.2.4 Reusing content

Reuse represents an important way for companies to extend the value of their investment in content. For example:

Other examples include building contracts from libraries of boilerplate text, publishing blogs and news releases as RSS feeds, and making documents available through an internal Web portal for arbitrary reuse.

3.3 Why you need a native XML database

Assuming they were built at all, many of the applications described earlier were originally built from full-text search engines, relational databases, and flat files. For example, they stored metadata in the relational database and documents in the file system, or they stored documents as CLOBs and copied them to the file system for use with a full-text search engine.

These systems suffered from two main problems. The first was scalability. According to one vendor, such systems "usually degrade very quickly past a few thousand documents, while the applications typically involve millions of documents." The second problem was the lack of structured queries, since the full-text search engines were not XML aware and queries over metadata were limited to a few fields. Other problems included synchronization between the database and non-database components, the need to write custom code to process results, lack of node-based updates (a problem for large documents), brittleness in the face of evolving schemas, and the usual laundry list of reasons for using a database in the first place: concurrency, security, transactional safety, and so on.

(This is not to say that such systems don't work, especially under controlled conditions. For example, the American Geophysical Union uses one to manage roughly 95,000 scientific papers. Full papers are stored in a search engine repository, which provides full-text searches to Web site users. Metadata is extracted and stored in a relational database, where it is used to generate abstracts, bibliographic entries, and live citations. In the system's favor are a low growth rate (about 25 papers per day), very few updates, XML-awareness in the search engine, and a slowly evolving schema. The latter is particularly important, as schema evolution is the biggest problem in the system, requiring changes to the database schema, the XML-to-database mapping, and the extraction code.)

On the other hand, native XML databases have a number of features that are useful for working with document-centric XML. The most important are the XML data model, which is flexible enough to model documents, XML-aware full-text searches, and structured query languages like XQuery. These allow documents to be stored and queried in a single location, rather than multiple locations held together by glue code. Other useful features include node-level updates (which reduce the cost of updating large documents), links, versioning, and more flexibility in handling schema evolution than is found in relational databases. And while not all native XML databases can scale into the gigabyte or terabyte range, some clearly can.

3.4 A peek into the future

We will finish with a look at some recent developments in the field of content management:

4.0 Data integration

The second major use of native XML databases is data integration. XML is well-suited to data integration because of its flexible data model and machine-neutral text format. In addition, XQuery is a good data integration language because of its ease of use, support for transformations, and ability to join data from different documents (data sources). Finally, there are a large number of tools for converting data from various formats to XML.

4.1 Data integration in the real world

Let's start by looking at some of the ways in which native XML databases are used to integrate data:

4.2 Inside the applications

Data integration applications must solve a number of problems, from data access to security to change management. In this section, we will look at the architectures used to solve two of these problems: queries and mapping schemas.

4.2.1 Query architectures

There are two query architectures for integrating data with a native XML database: local and distributed. In a local query architecture, data is imported into the database as XML and queried locally. In a distributed query architecture, data resides in remote sources and the query engine distributes queries across those data sources. The engine then compiles results and returns them to the application.

The main advantage of local queries is that they are faster, since no remote calls are made. They are also simpler to optimize, and the engine is simpler to implement, as all queries are local. Their main disadvantage is that data may be stale. A secondary problem is access control, as the local store must enforce controls previously handled by each source. Distributed queries have the opposite advantages and disadvantages: data is live, but queries are slower and harder to optimize and the engine is more complex.

Which architecture to use depends on a number of factors:

If a data source cannot be included in a distributed query, its data must be imported and queried locally. About a third of the commercial native XML databases, including most of the popular ones, can import data. Support ranges from a few sources, such as relational databases and Microsoft Word, to hundreds of sources. Some databases also have APIs so you can write your own import modules. And a few databases can refresh data in response to triggers or timeouts. If your database cannot import data, you must do so yourself, such as with the help of a third-party converter.

4.2.2 Handling differences in schemas

The biggest problem in integrating data is handling differences in schemas. With structural differences, the same concept is represented differently, such as a name using one or multiple fields. With semantic differences, slightly different concepts are represented; these can be straightforward (a price is in US dollars or Euros) or subtle (a price includes a discount). Handling schema differences is mostly just hard work, although some differences cannot be completely resolved. Since the actual resolutions depend on the differences, this section looks at where differences can be resolved.

If data is grouped by schema, such as in relational databases or (sometimes) in native XML database collections, three architectures are common:

If data is not grouped by schema, application logic is more complex, since is it no longer possible to write location-based queries. If documents are limited to a set of schemas, it may be possible to base queries on document structure. For example, our species name query could check if the root element is Bsml or MAGE-ML before handing a document off to a function that extracts the species name.

If documents can have any schema, such as when a law office subpoenas all documents pertaining to a particular company and automatically converts them to XML, the best that can usually be done is to explore the documents with a browser to determine if there are common structures or fields. If so, it may be possible to use these in production queries or convert documents so that such queries are possible. As a last resort, users can always perform XML-aware full-text searches.

It may also be possible to ignore differences. For example, suppose an application displays information about a customer from multiple sources. Since a human is reading the information, they can resolve many differences, such as whether a name is stored in one or two fields.

4.3 Why you need a native XML database

Vendors report that most of their data integration customers were not able to solve their problems without a native XML database. The problem was that other solutions, such as federated relational databases and data integration toolkits, either could not model the types of data involved (documents, semi-structured data, hierarchical data, and so on), could not handle data whose schema was unknown at design time, and/or could not handle data whose schema changed frequently.

Native XML databases solve the first two problems with the XML data model, which is considerably more flexible than the relational model and can handle schemaless data. While native XML databases do not provide a complete solution for schema evolution, they can at least store documents with rapidly evolving schemas, as is discussed later.

Another advantage of native XML databases is that many support XQuery which, as was mentioned earlier, is a good data integration language.

4.4 A peek into the future

An interesting use of native XML databases in data integration is as a repository for metadata and semantic information. For example, the XML Business Information Portfolio from Software AG uses Tamino to store metadata and semantic information, as well as how to retrieve data from backend sources. While non-trivial to set up, this allows applications to execute queries against a common schema.

CompuCredit used an early version of this repository to integrate customer data from more than 100 systems and databases, each of which is exposed as a Web Service. In response to a query, the repository constructs an XML document from backend data and ships it across an Enterprise Service Bus to the customer service representative, who receives a single view of the customer's data.

5.0 Working with semi-structured data

Managing semi-structured data is the third major use case for native XML databases. Semi-structured data has some structure, but isn't as rigidly structured as relational data. While there is no formal definition for semi-structured data, some common characteristics are:

5.1 Semi-structured data in the real world

Semi-structured data occurs in many fields. For example, here are some of the types of semi-structured data that are being stored in native XML databases today:

5.2 Inside the applications

Applications that work with semi-structured data that has a known schema are not significantly different from applications that work with other kinds of data. For example, they use queries defined at design time to retrieve and update data. The main difference is that they often must handle data represented in different ways in different parts of the data set. While this may be unpleasant, as long as the number of variations is limited, it is usually possible.

Applications that work with semi-structured data containing fields not known at design time are fundamentally different. As a general rule, such applications pass unknown fields to humans for processing. For example, suppose a catalog has a basic structure defined by a central authority and uses vendor-specific XML to describe individual items. A catalog browser might be hard-coded to navigate the known structure and use XML-aware full-text searches or // searches to search the unknown structure. Product data might be displayed as raw XML or converted to XHTML with a stylesheet that displays data based on nesting level.

Similar applications are found in molecular biology, genetics, health care, and library science. In each case, the data describes something -- a molecule, a gene, a patient, an archive -- and many of the fields are known. The application uses these fields, such as to allow the user to drill into the data, and then displays the unknown fields. The person reading the data can interpret it and take further action, such as reading a scientific paper, making a diagnosis, or adding comments.

Another common solution is for the application to evolve with the data. For example, incoming documents can be examined with a generic browser to decide what kinds of queries are possible. In some cases, it might be possible to write specific queries, such as //address to search for addresses; in other cases, the only choice might be full-text searches. While this kind of development is likely to be repugnant to programmers accustomed to working with well-defined schemas, it is a huge improvement for users whose previous choice was to wade through reams of paper or search files in a variety of formats using a variety of tools.

5.3 Why you need a native XML database

XML is a good way to represent semi-structured data: it does not require a schema; it is self-describing (albeit minimally so); and it represents sparse data efficiently. Thus, native XML databases are a good way to store semi-structured data. They support the XML data model, they can index all fields (even those unknown at design time), they support XML query languages and XML-aware full-text searches, and some support node-based updates.

Relational databases, on the other hand, do not handle semi-structured data well. The main problem is that they require rigidly defined schemas. Thus, fields not known at design time must be stored abstractly, such as with property-value pairs, which are difficult to query. They are also difficult to change as the schema evolves. A secondary problem is that they do not handle sparse data efficiently: the choices are a single table with lots of NULLs, which wastes space, or many sparsely populated tables, which are expensive to join.

According to vendors, many customers couldn't handle their semi-structured data until they used a native XML database. Other customers used a variety of tools, such as grep, full-text search engines, and proprietary applications, or stored some data in a relational database and complete documents as flat files, CLOBs, or even Word documents. As a general rule, these solutions worked in the initial stages, but had limited query capabilities, didn't scale well, and were difficult to maintain as schemas evolved.

(A notable exception occurred in the field of biology. The AceDB database was initially written to store data about the worm C. elegans. It has since evolved into a generic, object-oriented database with its own schema, query languages, and data browsers. Other databases, such as UniProt/Swiss-Prot and GenBank are (apparently) available in relational and flat-file formats, but are generally queried through proprietary tools such as SRS and Entrez.)

5.4 A peek into the future

Semi-structured data is still straddling the boundary between academia and industry, so the near term is most likely to consist of gaining experience -- managing data, writing applications, handling evolution, and so on -- than creating definitive tools.

6.0 Schema evolution

As someone accustomed to the relatively rigid schemas of the relational world, I react to stories of rapid schema evolution with a mixture of horror and a sense that perhaps the people involved aren't as, well, responsible as they should be. In spite of this, almost every vendor and customer I spoke to listed schema evolution as one reason to use a native XML database. Worse yet, most had good reasons for doing so.

6.1 Schema evolution in the real world

Schema evolution is a normal thing. In the relational world it moves slowly, for both technical and political reasons. On the technical side, relational databases do not handle schema changes easily: existing data must be updated to match the new schema and altering tables may require unloading and reloading data. On the political side, database administrators (DBAs) tend to approach change cautiously because they don't want to break existing applications or destabilize database tuning.

(A number of vendors also noted that native XML databases allow developers to do an end run around DBAs, resulting in faster development times. One reason for this might be that native XML databases are often used to cache data on the middle tier, meaning DBAs are not aware of them and have not yet brought them under their control. Another reason might be that native XML databases do not have as many tuning options as relational databases, meaning that DBAs have less reason to exert control.)

In the XML world, change moves faster. This is sometimes due to the newness of XML. For example, FIXML has had four versions in six years and FpML has had four versions in just three years. XML has also exposed users to more sources of change. For example, the schemas used to move data across organizational boundaries are often controlled by other departments or trading partners. And XML is being used in rapidly evolving fields, such as finance and biology, as well as fields with long life spans, such as mortgage and insurance contracts, both of which force users to handle many versions of a schema.

6.2 Inside the applications

Handling schema evolution is rarely easy. The easiest solution is to update data to conform to the new schema and update applications accordingly. Unfortunately, this is not always possible. For example, updating existing data might be too expensive or might be prohibited (such as with contracts), new fields might not have reasonable defaults, or multiple applications might use the data and cannot all be updated.

When data cannot be updated, applications must handle both backwards and forwards compatibility. Since documents conforming to multiple versions of a schema are commonly stored together in native XML databases, applications must determine which version of a schema is being used, such as by checking a version attribute or checking whether a particular field exists.

Handling backwards compatibility usually just means a lot of hard work, such as providing default values for fields added in a new schema or processing each version of a field differently. However, some problems have no definitive solution, such as how to compute the average of a field not found in all documents.

Handling forwards compatibility means protecting applications against an unknown future. A liberal strategy is to ignore all unrecognized fields. Unfortunately, this is risky, as new fields may change the semantics of existing fields. A more conservative strategy is to only process documents with a known version number. This allows applications to continue working until it can be determined whether a schema change breaks existing code.

A strategy that avoids many forwards and backwards compatibility problems is to query only those fields that are unlikely to change. This works particularly well when humans are involved. For example, a customer service representative might search for contracts involving a particular customer or a researcher might search for documents describing a particular organism. In both cases, searches are done on stable fields (customer or species name) and the reader can resolve any differences in schemas.

6.3 Why you need a native XML database

The main advantage of native XML databases with respect to schema evolution is the ability to store documents conforming to several different versions of a schema. This has several advantages over relational databases, which require data to conform to a single schema:

A secondary advantage of native XML databases is support for XQuery. The conditional expressions and user-defined functions in this language are very useful in querying documents conforming to multiple versions of a schema.

This is not to say that native XML databases solve all schema evolution problems. Far from it -- schema evolution remains a painful problem that requires both foresight and hard work. However, the consensus among vendors and customers is that the flexibility of native XML databases makes solutions possible where they weren't before.

6.4 A peek into the future

The only happy news about schema evolution is that more people are becoming aware of the problem, as evidenced by the number of emails and conference presentations on the subject. Personally, I hold little hope for any silver bullets, as the problem pre-dates XML and has not been solved yet.

7.0 Long-running transactions

Long-running transactions are real-world transactions such as processing insurance claims, approving mortgages, or fulfilling orders. They generally require a mixture of human and machine processing and take anywhere from hours to weeks. They differ from traditional transactions in that they do not lock resources for the duration of the transaction and they use compensating transactions, such as refunds, instead of rollbacks.

How data flows through a long-running transaction depends on the application. It might be stored in a database and modified by a succession of applications, or it might be passed from application to application in one or more XML documents, as in a Service Oriented Architecture (SOA).

Native XML databases can be used in long-running transactions in a number of capacities:

It is interesting to note that several Enterprise Service Buses (ESBs), which are used to implement SOAs, include native XML databases: Sonic ESB includes Sonic XML Server, Software AG's Enterprise Service Integrator includes Tamino, and OpenLink's Virtuoso includes a BPEL engine. These systems use native XML databases for all of the reasons described above.

8.0 Handling large documents

Large documents are difficult to query due to the time it takes to parse them. Native XML databases solve this problem by parsing and indexing documents when they are inserted. This allows documents to be queried without further parsing and may even allow queries to be resolved only by searching indexes.

Large documents are also difficult to process with XSLT and DOM, as these require the entire document to be in memory. Since sufficiently large documents exceed available memory, some native XML databases solve this problem by implementing XSLT and DOM directly on top of the database. These implementations populate in-memory nodes as necessary and swap nodes back to disk as needed, allowing DOM and XSLT to be used with documents of almost arbitrary size. In addition, changes made to DOM trees are reflected back to the database, either immediately or in response to a special call.

The main use of such DOM implementations is in browsers and editors for document-centric documents, such as catalogs and technical manuals. While most of these are custom applications built on top of native XML databases, Infonyte has built a customizable browser (the Infonyte Reader) on top of its native XML database (Infonyte DB), which features both query and XSLT engines.

9.0 Hierarchical data

Hierarchical data is a use case that overlaps all other use cases, since virtually all XML is hierarchical. Hierarchical data is either heterogenous, like sales orders, in which parents and children have different types, or homogenous, like catalogs or bills of material, in which parents and children have the same type. In a relational database, heterogenous hierarchies are stored in multiple tables, which must be joined during queries, and homogenous hierarchies are stored in a single table, for which there are a variety of query strategies, including nested sets and recursive queries.

While there is little public data available for the relative performance of native XML databases and relational databases in querying hierarchical data, it is interesting to note that Xyleme Zone Server outperforms Oracle 9i by a factor of 19.5 when using Oracle's test data and Oracle's object-relational XML storage. Similarly, another native XML database vendor asserted that even "three to four levels [in a heterogenous hierarchy] presents a problem for relational [databases], once the [number] of documents is in the hundreds of thousands." (Of interest, vendors report that most of their customer's hierarchies are five to ten levels deep, although up to thirty levels are not uncommon.)

Unfortunately, similar data is not available for relative query performance in homogenous hierarchies. However, there is persistent confusion among SQL programmers about how to store and query hierarchical data. While this may be alleviated with the introduction of recursive queries in relational databases (available for several releases in Oracle, one release in DB2, and the next release in SQL Server), perhaps the most important thing that native XML databases bring to the table with respect to hierarchical data is a set of tools -- notably query languages -- that are explicitly designed for working with hierarchies.

10.0 Other uses

This article has described the most common use cases for native XML databases. Some other use cases are:

11.0 A final peek into the future

Our final peek into the future looks at relational databases. In the strongest endorsement of native XML databases to date, the major relational databases are adding native XML storage. This is used to implement a first-class XML data type and data stored as this type can be queried with XPath or XQuery. It can also be mixed with relational data.

The implementation strategies used by relational databases are as varied as those found in commercial native XML databases: Oracle indexes documents and stores them as CLOBs; Sybase also indexes documents (it is not known how they store them); SQL Server stores pre-parsed documents as BLOBs, as well as in node-level storage built on relational tables (the query engine decides which to use at run time); and DB2 uses node-level storage built from the ground up.

In addition, Oracle is working on XML Data Synthesis (XDS), an XQuery-based data integration engine.

12.0 Conclusion

This article has looked at how native XML databases are used in the real world -- most commonly for managing documents, integrating data, and managing semi-structured data. What is important about these uses is that most represent cases where people have tried to use relational or other types of databases and have either failed or written less sophisticated applications than they would like. Native XML databases have succeeded because of their query languages (most notably XQuery, but also XML-aware full-text queries), the flexibility of the XML data model, and their ability to handle schemaless data.

So is a native XML database in your future? That question is best answered by quoting Arun Gaikwad. In an article about Xindice, a native XML database from Apache, he wrote: "A [native XML database] is something which you may think is unnecessary but once you start using it, you wonder how you would survive without it."

13.0 Thanks

I would like to thank representatives of the following companies and organizations for contributing their time and thoughts to this article: American Geophysical Union, Bluestream Database Software, Cincom, data ex machina, IBM, Ipedo, ISPRAS modis, IXIASOFT, M/Gateway Developments, Mark Logic, Ontonet, OpenLink Software, QuiLogic, RainingData, Snapbridge Software, Software AG, X-Hive, Xpriori, and Xyleme. Thanks also to developers and users who chose to remain anonymous.

14.0 Resources

Use cases for native XML databases:

Selected case studies:

Native XML databases: