XML-DBMS: Frequently asked questionsVersions 1.x and 2.0: How do I handle large documents in XML-DBMS?Most large documents seem to fall into one of three categories:
Repeating data. For example, you have thousands or millions of sales
orders in the same document or astronomical data. In this case, there is
generally a wrapper element around what really are a set of separate
documents or "rows" of data. For example, the following document consists of multiple separate sales orders, each of which can be inserted separately:
<SalesOrders>
<SalesOrder>
...
</SalesOrder>
<SalesOrder>
...
</SalesOrder>
<SalesOrder>
...
</SalesOrder>
</SalesOrders>
Related data. In this case, you simply have a huge amount of related
data. I have heard of financial transactions that require multiple MBs
of XML because of all the contextual information that must be
transmitted to process the actual transaction.
Documents. It is possible for documents (such as books) to be as
large as 5 MB. However, for this to happen, the documents would probably
need to include graphics encoded as Base64 -- 5 MB is a *lot* of text.
Because XML-DBMS uses DOM trees to represent XML documents, it has size
limitations. DOM trees are kept in memory and are larger than the
original document, so large documents can easily exceed available
memory. I'm not sure that 5 MB documents would cause problems on a
modern machine, though. Even if the DOM tree is 10 times larger than the
original document, this is still only 50 MB. Large documents with repeating data (case 1) can be easily
processed by "cutting" them into separate documents, each of which is
processed separately. The cutter uses SAX to read the documents and creates DOM trees based on a particular element, such as the <SalesOrder element in the example above. As long as the sub-documents are not too large, it can process documents of any size. One way to do this is to write an application that uses SAXDOMIX to split the document into smaller documents, then make separate calls to DOMToDBMS.storeDocument for each smaller document. Your application needs to implement the
SDXController interface in SAXDOMIX, which consists of two methods: wantDOM and handleDOM. wantDOM returns true if you want a DOM tree returned for the element. In
the above case, wantDOM would return true for <SalesOrder> elements and
false for all other elements. When wantDOM returns true, SAXDOMIX passes
the DOM tree to handleDOM. In the case of an XML-DBMS application,
handleDOM would pass the DOM tree to DOMToDBMS.storeDocument. SAXDOMIX is Open Source and is available from: http://www.devsphere.com/xml/saxdomix/index.html You can also find a simple, SAX-based "cutter" in section 7.1 of the IBM Redbook "XML for DB2 Information Integration": http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246994.html?Open Whether it is possible to process other types of large documents (cases
2 or 3) depends on whether it is possible to cut such documents into
separate pieces, each of which can be processed separately. (For
example, is it acceptable to insert sections of a book separately?) In
any case, you would probably need custom code to pre-process the
documents and break them into manageable pieces. Back to the XML-DBMS FAQs
|