Mapping DTDs to Databases

Copyright (c) 2000-2005 by Ronald Bourret

(Last updated: September, 2005)

1.0 Overview

A common question in the XML community is how to map XML to databases. This article discusses two mappings: a table-based mapping and an object-relational (object-based) mapping. Both mappings model the data in XML documents rather than the documents themselves. This makes them a good choice for data-centric documents but a poor choice for document-centric documents. (The table-based mapping can't handle mixed content at all, and the object-relational mapping of mixed content is extremely inefficient.)

Both mappings are commonly used as the basis for software that transfers data between XML documents and databases, especially relational databases. An important characteristic in this respect is that they are bi-directional. That is, they can be used to transfer data both from XML documents to the database and from the database to XML documents. One consequence of this is that they are likely to be used as canonical mappings on top of which XML query languages can be built over non-XML databases. That is, the canonical mappings will define virtual XML documents that can be queried by a language such as XQuery.

In addition to being used to transfer data between XML documents and databases, the first part of the object-relational mapping is used in "data binding" -- that is, the marshalling and unmarshalling of data between XML documents and objects.

2.0 Table-based Mapping

There is an obvious mapping between the following XML document and table:

   <A>                     
      <B>
         <C>ccc</C>             Table A
         <D>ddd</D>             -------
         <E>eee</E>          C     D     E
      </B>                  ---   ---   ---
      <B>             <=>   ...   ...   ...
         <C>fff</C>         ccc   ddd   eee
         <D>ggg</D>         fff   ggg   hhh
         <E>hhh</E>         ...   ...   ...
      </B>
   </A>

This mapping is called a table-based mapping. It views the document as a single table or a set of tables. That is, the structure of the document must be either:

   <Table>
      <Row>
         <Column_1>...</Column_1>
         ...
         <Column_n>...</Column_n>
      </Row>
      ...
      <Row>
         <Column_1>...</Column_1>
         ...
         <Column_n>...</Column_n>
      </Row>
   </Table>

or:

   <Tables>
      <Table_1>
         <Row>
            <Column_1>...</Column_1>
            ...
            <Column_n>...</Column_n>
         </Row>
         ...
      </Table_1>
      ...
      <Table_n>
         <Row>
            <Column_1>...</Column_1>
            ...
            <Column_m>...</Column_m>
         </Row>
         ...
      </Table_n>
   </Tables>

with the caveat that column data can be represented either as PCDATA-only elements (shown) or attributes.

The obvious advantage of this mapping is that it is simple and easy to understand. Because it matches the structure of tables and result sets in a relational database, it is easy to write code based on this mapping. This code is fast and scales well, and is quite useful for certain applications, such as transferring data between databases one table at a time.

The mapping has several disadvantages, the most notable of which is that it only works with a very small subset of XML documents. In addition, it does not preserve physical structure (such as character and entity references, CDATA sections, character encodings, or the standalone declaration), document information (such as the document type or DTD), comments, or processing instructions.

Despite its limitations, the table-based mapping is widely implemented, such as in middleware used to transfer data between XML documents and relational databases, in Web application servers and XML-enabled databases to return result sets as XML, and in many home-grown applications. An important use is in implementing XQuery over relational databases, in which each table is treated as a virtual XML document (according to the table-based mapping) and then queried through XQuery.

3.0 Object-Relational Mapping

Because table-based mappings only work with a limited subset of XML documents, some middleware, most XML-enabled relational databases, and most XML-enabled object servers use a more sophisticated mapping, called an object-relational mapping. This models the XML document as a tree of objects that are specific to the data in the document, then maps these objects to the database.

(The name "object-relational" is actually a misnomer -- a better name would be object-based mapping. This is because the objects can be mapped to non-relational databases, such as object-oriented or hierarchical databases, or simply left alone, as is done in data binding systems. However, because object-relational is a well-known term and this mapping is most commonly used with relational databases, that term will be used here. In addition, all examples use relational tables.)

To understand the object-relational mapping, it is easiest to first look at some simple examples. To start with, notice that there is an obvious mapping between the following XML document, object, and row in a table:

        XML                  Objects                  Tables
   =============           ============           ===============

                                                     Table A
   <A>                     object A {                 -------
      <B>bbb</B>              B = "bbb"            B     C     D
      <C>ccc</C>    <=>       C = "ccc"    <=>    ---   ---   ---
      <D>ddd</D>              D = "ddd"           ...   ...   ...
   </A>                    }                      bbb   ccc   ddd
                                                  ...   ...   ...

Similarly, there is an obvious mapping between the following element type definition, class, and table schema:

            DTD                        Classes                   Table Schema
   ======================           ============           ==========================

                                    class A {              CREATE TABLE A
                                       String B;              B VARCHAR(10) NOT NULL,
   <!ELEMENT A (B, C, D)>    <=>       String C;    <=>           C VARCHAR(10) NOT NULL,
                                       String D;              D VARCHAR(10) NOT NULL
                                    }                      )

As a more complex example, consider the following XML document:

   <SalesOrder>
      <Number>1234</Number>
      <Customer>Gallagher Industries</Customer>
      <Date>29.10.00</Date>
      <Item Number="1">
         <Part>A-10</Part>
         <Quantity>12</Quantity>
         <Price>10.95</Price>
      </Item>
      <Item Number="2">
         <Part>B-43</Part>
         <Quantity>600</Quantity>
         <Price>3.99</Price>
      </Item>
   </SalesOrder>

which maps to the following objects:

      object SalesOrder {
         number = 1234;
         customer = "Gallagher Industries";
         date = 29.10.00;
         items = {ptrs to Item objects};
      }           /   \
                 /     \
                /       \
   object Item {       object Item {
      number = 1;         number = 2;
      part = "A-10";      part = "B-43";
      quantity = 12;      quantity = 600;
      price = 10.95;      price = 3.95;
   }                   }

and then to rows in the following tables:

   SaleOrders
   ----------
   Number   Customer                Date
   ------   --------------------    --------
   1234     Gallagher Industries    29.10.00
   ...      ...                     ...
   ...      ...                     ...
   
   
   Items
   -----
   SONumber   Item   Part   Quantity   Price
   --------   ----   ----   --------   -----
   1234       1      A-10   12         10.95
   1234       2      B-43   600        3.99
   ...        ...    ...    ...        ...

All of these are examples of object-relational mappings from DTDs to relational databases.

3.1 The Basic Mapping

An object-relational mapping is performed in two steps. First, an XML schema (a DTD in this case) is mapped to an object schema, then the object schema is mapped to the database schema. The two mappings can optionally be combined for a direct DTD-to-database mapping, as is done in most software today.

In considering this mapping, it is important to understand that the objects involved are specific to each DTD and not objects from the DOM. In particular, these objects model the data in the XML document, while the DOM models the structure of the XML document. For example, the following shows the object trees for data-specific objects and the DOM that would be created for the XML document in the example above:

    SalesOrder                  Document
     /      \                       |
   Item    Item                  Element_______
                          ______/ /  \ \______ \_________
                         /       /    \       \          \
                    Element  Element  Element  Element   Element
                       |        |        |    /  |  \ \    etc.
                     Text     Text     Text  /   |   \ \_______
                                            /    |    \        \
                                      Element Element Element  Attr
                                         |       |       |       |
                                       Text    Text    Text    Text

This distinction is particularly important when you consider how the data is stored in the database. To store the data-specific objects, you will need SalesOrders and Items tables; to store the DOM objects, you will need Document, Element, Text, and Attr tables. The most important difference is that non-XML applications can use the data-specific tables but can't use the DOM-specific tables.

3.1.1 Mapping DTDs to Object Schemas

The mapping starts by recognizing that element types are data types. Element types that have PCDATA-only content are called simple element types; the term is borrowed from XML Schemas. These hold a single data value and are equivalent to scalar data types in an object-oriented programming language. (Note that the word "scalar" is used here to mean "consisting of a single data value". In some languages, "scalar" data types -- in this sense of the word -- are represented by objects. The most notable example of this is the String data type in Java.) Attribute types are also simple types.

Element types that have element or mixed content, or that have attributes, are called complex data types; the term is again borrowed from XML Schemas. These hold a structured value and are equivalent to classes in an object-oriented programming language or structs in C. Note that an element type that has empty content but has attributes is still "complex". The reason for this is that the attributes also provide structure and are roughly equivalent to child PCDATA-only elements.

The object-relational mapping first maps simple types to scalar data types. For example, the element type Title might be mapped to a String, and the element type Price might be mapped to a float. It then maps complex types to classes, with each element type in the content model of the complex type mapped to a property of that class. The data type of each property is the data type to which the referenced element type is mapped. For example, a reference to a Title element would be mapped to a String property and a reference to a Price element would be mapped to a float property. References to complex element types are mapped to pointers/references to an object of the class to which the complex element type is mapped.

The last part of the mapping maps attributes to properties, with the data type of the property determined from the data type of the attribute. Note that attributes are equivalent to references to element types in a content model. This is because, like references in a content model, they are local to a given element type. The only conceptual difference is that the attribute type is defined locally, rather than at a global (DTD-wide) level, as is the case with element types.

For example, in the following, the simple element types B, D, and E and the attribute F are all mapped to Strings and the complex element types A and C are mapped to classes A and C. The content models and attributes of A and C are mapped to properties of classes A and C. That is, the references to B, D, and E in the content models of A and C are mapped to String properties (because the types are mapped to Strings) and the attribute F is also mapped to a String property. The reference to C in the content model of A is mapped to a property with the type pointer/reference to an object of class C because element type C is mapped to class C.

              DTD                         Classes
   =========================           =============

   <!ELEMENT A (B, C)>                 class A {
   <!ELEMENT B (#PCDATA)>                 String b;
   <!ATTLIST A                  ==>       C      c;
          F CDATA #REQUIRED>              String f;
                                       }


   <!ELEMENT C (D, E)>                 class C {
   <!ELEMENT D (#PCDATA)>       ==>       String d;
   <!ELEMENT E (#PCDATA)>                 String e;
                                       }

One important point to reiterate here is that mapping references to element types in a content model is different than mapping the element types themselves. Element types map to data types, while references to element types map to properties in structured data types (classes). The difference is clear when you consider an element type that is referenced in two different content models. In this case, each reference must be mapped separately, with the data types of the resulting properties determined by the data types to which the element types themselves (not the references) are mapped.

For example, consider the Title and Section element types in the following. Each of these element types is referenced in the content models of both Chapter and Appendix. Each reference is mapped separately for each parent element type. The data type of the properties to which the references to Title are mapped is String, since Title contains only PCDATA and is mapped to a String. The data type of the properties to which the references to Section are mapped is a pointer/reference to a Section object, since the Section element type is complex and is mapped to a Section class.

                     DTD                                      Classes
   =========================================               =============

   <!ELEMENT Chapter (Title, Section+)>                    class Chapter {
   <!ELEMENT Appendix (Title, Section+)>          ==>         String    title;
   <!ELEMENT Title (#PCDATA)>                                 Section[] sections;
   <!ELEMENT Section (#PCDATA | p | b | i)*>               }

                                                           class Appendix {
                                                              String    title;
                                                              Section[] sections;
                                                           }

The other important point to reiterate is that simple element types and attributes can be mapped to data types other than String. For example, an element type named Quantity might be mapped to an integer. When mapping from a DTD, this requires human intervention, since there is no way to predict the target data type from a PCDATA-only element type. However, when mapping from an XML Schema, the target type is known since XML Schemas have data types.

3.1.2 Mapping Object Schemas to Database Schemas

In the second part of the object-relational mapping, classes are mapped to tables (known as class tables), scalar properties are mapped to columns, and pointer/reference properties are mapped to primary key/foreign key relationships. For example:

     Classes                     Tables
   ============             =================
   class A {                Table A:
      String b;                Column b
      C      c;     ==>        Column c_fk
      String f;                Column f
   }

   class C {                Table C:
      String d;     ==>        Column d
      String e;                Column e
   }                           Column c_pk

Note that the tables are joined by a primary key (C.c_pk) and a foreign key (A.c_fk). Because the relationship between the parent and child elements is one-to-one, the primary key can be in either table. If the relationship is one-to-many, the primary key must be on the "one" side of the relationship, regardless of whether this is the parent or child. For example, if a SalesOrder element contains multiple Item elements, then the primary key must be in the SalesOrder table (the parent). But if each Item element contains a Part element, the primary key must be in the Part table (the child), since one part can occur in many items.

One consequence of using primary / foreign key relationships is that the mapping can be used for elements that occur recursively. For example, suppose you have the following element type definition:

   <!ELEMENT A (A?, B, C)>
   <!ELEMENT B (#PCDATA)>
   <!ELEMENT C (#PCDATA)>

In this case, A can occur as a child of itself. This is mapped to the database in the same way that any other reference to a complex element type is mapped -- with a primary key / foreign key relationship. The only difference in the recursive case is that the foreign key points to the same table as the primary key. For example, element type A could be mapped to the following class:

   class A
   {
      A      a; // Nullable
      String b;
      String c;
   }

and then to the following table, where a_fk is a foreign key pointing to the row containing a child A element:

   Table A
   =======
   Column a_pk
   Column a_fk
   Column b
   Column c

There are two possible sources of the primary key column. First, it can be created as part of the mapping, as is the case for the examples above. Second, an existing column or columns can be used as the primary key. For example, if a SalesOrder element type has a Number child, this might be mapped to a column that is used as the primary key column.

If a primary key column is created as part of the mapping, its value must be generated by the database or the data transfer software. While this is generally considered better database design than using data columns as primary keys, it has a disadvantage when used with XML, and this is that the generated key is meaningless outside the source database. Thus, when data with a generated key is transferred to an XML document, it will either contain a meaningless primary key (if the primary key value is transferred) or no primary key at all (if the key is not transferred). In the latter case, it may be impossible to re-identify the source of the data, which is a problem if the data is modified and returned to the database as an XML document.

3.1.3 Miscellany

Before we continue with the more complicated parts of the mapping, we need to mention three things. First, names can be changed during the mapping. That is, the DTD, object schema, and relational schema can all use different names. For example, the following DTD uses different names than the following class:

                DTD                                   Classes
   ===============================             =====================

   <!ELEMENT Part (Number, Price)>             class PartClass {    
   <!ELEMENT Number (#PCDATA)>         ==>        String numberProp;
   <!ELEMENT Price (#PCDATA)>                     float  priceProp;
                                               }

which uses different names than the following table:

         Classes                           Tables
   =====================             ==================

   class PartClass {                 Table PRT
      String numberProp;     ==>        Column PRTNUM
      float  priceProp;                 Column PRTPRICE
   }

Second, data types can be changed during the mapping. That is, the DTD, object schema, and database schema can all use different data types. For example, the following DTD uses a different data type for Price than the following class:

                DTD                                   Classes
   ===============================             =====================

   <!ELEMENT Part (Number, Price)>             class Part {    
   <!ELEMENT Number (#PCDATA)>         ==>        String number;
   <!ELEMENT Price (#PCDATA)>                     float  price;
                                               }

which uses a different data type than the following table:

         Classes                           Tables
   =====================             ==================

   class Part {                 Table Parts
      String number;     ==>        Column number VARCHAR(50)
      float  price;                 Column price DECIMAL(10,2)
   }

All that is required is that there be a defined conversion between the data types. Although lossy conversions are possible, they are certainly not encouraged.

Third, the objects involved in the mapping are conceptual. That is, there is no need to actually instantiate them when transferring data between an XML document and a relational database. (This is not to say that the objects can't be instantiated, either. Whether it is useful to instantiate the objects depends on the actual application.)

3.2 Mapping Complex Content Models

The content models we have seen so far are relatively simple. What happens with more complex content models, such as the following?

   <!ELEMENT A (B?, (C | ((D | E | F | G)*, (H | I)+, J?)))>

In this section, we will consider the various parts of the content model. Mapping the above example will be left as an exercise to the reader. (I've always wanted to say that :)

3.2.1 Mapping Sequences

As has already been seen, each element type referenced in a sequence is mapped to a property, which is then mapped either to a column or to a primary key / foreign key relationship. For example:

            DTD                          Classes                    Tables
   ======================              ============             ==============

   <!ELEMENT A (B, C)>                 class A {                Table A
   <!ELEMENT B (#PCDATA)>     ==>         String b;     ==>        Column b
                                          C      c;                Column c_fk
                                       }

   <!ELEMENT C (D, E)>                 class C {                Table C
   <!ELEMENT D (#PCDATA)>     ==>         String d;     ==>        Column d
   <!ELEMENT E (#PCDATA)>                 String e;                Column e
                                       }                           Column c_pk

3.2.2 Mapping Choices

As with sequences, each element type referenced in a choice is also mapped to a property, then either to a column or a primary key / foreign key relationship. The only difference from the way sequences are mapped is that the properties and columns are nullable. For example, suppose we changed the sequence in the content model of A to a choice. The mapping from DTD to object schema would then be as follows:

            DTD                                Classes
   ======================              ========================

   <!ELEMENT A (B | C)>                class A {
   <!ELEMENT B (#PCDATA)>     ==>         String b; // Nullable
                                          C      c; // Nullable
                                       }
   <!ELEMENT C (D, E)>                 class C {
   <!ELEMENT D (#PCDATA)>     ==>         String d;
   <!ELEMENT E (#PCDATA)>                 String e;
                                       }

and the mapping from object schema to database schema would be:

           Classes                               Tables
   ========================             ===========================

   class A {                            Table A (
      String b; // Nullable     ==>        Column b     // Nullable
      C      c; // Nullable                Column c_fk  // Nullable
   }
   class C {                            Table C (
      String d;                 ==>        Column d     // Not nullable
      String e;                            Column e     // Not nullable
   }                                       Column c_pk  // Not nullable

To see why this is true, consider the following XML document, which conforms to the above DTD. Because the choice in the content model of A requires that either B or C (but not both) be present as a child, one of the two corresponding properties (and columns) will always be null.

       XML                       Objects                    Tables
   =============               ============               ===========

                                                            Table A
                                                           ---------
   <A>                         object a {                  b    c_fk
      <B>bbb</B>      ==>         b = "bbb"      ==>      ---   ----
   </A>                           c = null                bbb   NULL
                               }                          ...   ...

Note that if the primary key used to join the tables was in table A, the corresponding foreign key column in C would not be nullable. This is because, if A did have a C child, the column would have to have a value to join it to the correct row in table A. If A did not have a C child, there would simply be no row in table C.

3.2.3 Mapping Repeated Children

Children that can occur multiple times in their parent, which are known as repeated children, are mapped to multi-valued properties and then either to multiple columns in a table or to a separate table, known as a property table.

If a content model contains repeated references to an element type, the references are mapped to a single property, which is an array of known size. This can be mapped either to multiple columns in a table or to a property table. For example, the following shows how to map a repeated reference to multiple columns in a table:

              DTD                              Classes                      Tables
   =========================                 ==============               ===========

   <!ELEMENT A (B, B, B, C)>                 class A {                    Table A
   <!ELEMENT B (#PCDATA)>         ==>           String[] b;      ==>         Column b1
   <!ELEMENT C (#PCDATA)>                       String   c;                  Column b2
                                             }                               Column b3
                                                                             Column c

If the + or * operator is applied to a reference, the reference is again mapped to a single property, which this time is an array of unknown size. Since the number of values can be arbitrarily large, the property must be mapped to a property table, which will contain one row for each value. The property table is linked to the class table by a primary key / foreign key relationship, where the primary key is in the class table. For example:

            DTD                            Classes                       Tables
   ======================               ==============               ==============

   <!ELEMENT A (B+, C)>                 class A {                    Table A
   <!ELEMENT B (#PCDATA)>      ==>         String[] b;      ==>         Column a_pk
   <!ELEMENT C (#PCDATA)>                  String   c;                  Column c
                                        }
                                                                     Table B
                                                                        Column a_fk
                                                                        Column b

3.2.4 Mapping Optional Children

Children that are optional in their parent are mapped to nullable properties, then to nullable columns. This has already seen for children that appear in a choice group, as in the following mapping from DTD to object schema:

              DTD                               Classes
   =========================            ========================

   <!ELEMENT A (B | C | D)>             class A {
   <!ELEMENT B (#PCDATA)>       ==>        String b; // Nullable
   <!ELEMENT C (#PCDATA)>                  String c; // Nullable
                                           D      d; // Nullable
                                        }
   <!ELEMENT D (E, F)>                  class D {
   <!ELEMENT E (#PCDATA)>       ==>        String e;
   <!ELEMENT F (#PCDATA)>                  String f;
                                        }

and object schema to database schema:

            Classes                               Tables
   ========================               ========================

   class A {                              Table A
      String b; // Nullable      ==>         Column b     // Nullable
      String c; // Nullable                  Column c     // Nullable
      D      d; // Nullable                  Column d_fk  // Nullable
   }
   class D {                              Table D
      String e;                  ==>         Column e     // Not nullable
      String f;                              Column f     // Not nullable
   }                                         Column d_pk  // Not nullable

It also applies when the ? or * operator is applied to a reference, as in the following mapping from DTD to object schema:

            DTD                                  Classes
   ======================               ==========================

   <!ELEMENT A (B?, C*)>                class A {
   <!ELEMENT B (#PCDATA)>      ==>         String b;   // Nullable
   <!ELEMENT C (#PCDATA)>                  String c[]; // Nullable
                                        }

and object schema to database schema:

            Classes                                    Tables
   ==========================               ===============================

   class A {                                Table A
      String b;   // Nullable      ==>         Column b     // Nullable
      String c[]; // Nullable                  Column a_pk  // Not nullable
   }
                                            Table C
                                               Column a_fk  // Not nullable
                                               Column c     // Not nullable

Note that the column used to store C (in property table C) is not nullable. This is because, if no C children are present in A, there are simply no rows in table C.

3.2.5 Mapping Subgroups

References in subgroups are mapped to properties of the parent class, then to columns in the class table, as in the following mapping from DTD to object schema:

            DTD                                     Classes
   =========================              ============================

   <!ELEMENT A (B, (C | D))>              class A {
   <!ELEMENT B (#PCDATA)>        ==>         String b; // Not nullable
   <!ELEMENT C (#PCDATA)>                    String c; // Nullable
   <!ELEMENT D (E, F)>                       D      d; // Nullable
                                          }
   <!ELEMENT D (E, F)>                    class D {
   <!ELEMENT E (#PCDATA)>        ==>         String e;
   <!ELEMENT F (#PCDATA)>                    String f;
                                          }

and object schema to database schema:

           Classes                                     Tables
   ============================               ==========================

   class A {                                  Table A
      String b; // Not nullable      ==>         Column b    // Not nullable
      String c; // Nullable                      Column c    // Nullable
      D      d; // Nullable                      Column d_fk // Nullable
   }
   class D {                                  Table D (
      String e;                       ==>        Column e
      String f;                                  Column f
   }                                             Column d_pk

You might be wondering why this is possible -- that is, what happened to the structure imposed by the subgroup? In fact, this structure appears only in the content model, not in instance documents. For example, both of the following documents conform to the above content model:

   <A>
      <B>bbbbbb</B>
      <C>cccccc</C>
   </A>

   <A>
      <B>bbbbbb</B>
      <D>
         <E>eee</E>
         <F>fff</F>
      </D>
   </A>

There is no way to determine the presence of the subgroup from these documents. Structurally, C and D are indistinguishable from B -- they are just children of A. Thus, they can be mapped like children of A that are not in a subgroup.

One consequence of mapping references in subgroups directly to properties of the parent class is that repeatability and optionality can be indirect. For example, in the following content model, C, D, and E are both optional and repeatable. They are repeatable because the + operator applies to them indirectly. C is optional because it is directly in a choice group, but D and E are optional because they are indirectly in a choice group.

               DTD                                       Classes
   ===============================            =============================

   <!ELEMENT A (B, (C | (D, E))+)>            class A {
   <!ELEMENT B (#PCDATA)>                        String   b;
   <!ELEMENT C (#PCDATA)>            ==>         String[] c; // May be null  
   <!ELEMENT D (E, F)>                           D[]      d; // May be null
   <!ELEMENT E (#PCDATA)>                        String[] e; // May be null
                                              }

3.3 Mapping Mixed Content

Mixed content is just a choice group to which the * operator applies indirectly, except that it can contain PCDATA mixed between child elements. Thus, the element type references in mixed content can be mapped first to properties that are nullable arrays of unknown size, then to property tables.

To see how to map mixed content, consider the following XML document:

   <A>
   This text <c>cc</c> makes
   <b>bbbb</b> no sense
   <c>cccc</c> except as
   <b>bb</b> an example.
   </A>

and then notice that it is essentially the same as the following document, in which PCDATA has been wrapped in <pcdata> elements:

   <A>
   <pcdata>This text </pcdata><c>cc</c><pcdata> makes
   </pcdata><b>bbbb</b><pcdata> no sense
   </pcdata><c>cccc</c><pcdata> except as
   </pcdata><b>bb</b><pcdata> an example.</pcdata>
   </A>

From this, it is easy to see that PCDATA can be treated like any other child element. Thus, PCDATA in mixed content is mapped to a nullable array of unknown size, then to a property table. The following shows how to map mixed content from a DTD to an object schema:

               DTD                                Classes
   ===============================           ===================

                                             class A {
   <!ELEMENT A (#PCDATA | B | C)*>              String[] pcdata;
   <!ELEMENT B (#PCDATA)>            ==>        String[] b;
   <!ELEMENT C (#PCDATA)>                       String[] c;
                                             }

and an object schema to a database schema:

         Classes                                 Tables
   ===================              ===================================

                                                       Table PCDATA
                                                    ------Column a_fk
   class A {                                       /      Column pcdata
      String[] pcdata;              Table A       /    Table B
      String[] b;           ==>        Column a_pk--------Column a_fk
      String[] c;                                 \       Column b
   }                                               \   Table C
                                                    \-----Column a_fk
                                                          Column c

To see what is actually stored in the database, consider the document shown at the start of this section. This is mapped to the following object, then rows in the following tables. (We assume that the system generates a primary key of value 1 for the row in the table for A. This is used to link the row in table A to the rows in the other tables.)

            Objects                                       Tables
   ============================               ===============================

                                                            Table PCDATA
                                                            a_fk  pcdata
                                                            ----  -----------
                                                             1    This text 
                                                             1    makes
   object a {                                                1    no sense
      pcdata = {"This text ",                                1    except as
                " makes ",                    Table A        1    an example.
                " no sense ",                   a_pk       
                " except as",        ==>        ----        Table B
                " an example."}                  1          a_fk   b
      b      = {"bbbb", "bb"}                               ----  ----
      c      = {"cc", "cccc"}                                1    bbbb
   }                                                         1    bb

                                                            Table C
                                                            a_fk   c
                                                            ----  ----
                                                             1    cc
                                                             1    cccc

One of the things that should be readily obvious from this example is that the object-relational mapping is not very efficient at storing mixed content. Because of this, it is more commonly used in data-centric applications, which tend to have little mixed content.

There are two ways to solve this problem. The first is to use a mapping other than the object-relational mapping. For example, if the document is modeled using the DOM or a similar structure and this is mapped to the database with an object-relational mapping, there are far fewer tables in the database -- Document, Element, Attr, Text, etc. -- although a similar number of joins are required to retrieve a document. The second strategy is to not break documents into their smallest possible components but instead to break them into larger pieces, such as chapters or sections. This strategy can be used with the object-relational mapping; for more information, see section 3.6.1, "Mapping Complex Element Types to Scalar Types".

3.4 Mapping Order

This section discusses how the object-relational mapping handles order.

3.4.1 Sibling Order, Hierarchical Order, and Document Order

Sibling means "brother or sister". Thus, sibling elements or PCDATA are elements or PCDATA that have the same parent. In other words, they appear in the same content model. For example, if the document from the previous section is represented as a tree, it is readily apparent which elements are siblings -- those elements at the second level of the hierarchy, which all have A as their parent.

                                   A
        ___________________________|______________________
       |      |    |    |     |      |      |      |      |
   This text  C  makes  B  no sense  C  except as  B  an example
              |         |            |             |
              cc       bbbb         cccc           bb 

Note that the elements at the third level of the hierarchy are not siblings. The reason for this is that they do not share the same parent. This also points out the difference between sibling order, which is the order in which children occur in their parent, and hierarchical order, which is the level at which children appear in a tree representing the document. Different still is document order, which is the order in which elements and text appear in an XML document. For example:

Sibling order (order not shown where there is only one sibling):

                                   A
        ___________________________|______________________
       |      |    |    |     |      |      |      |      |
   This text  C  makes  B  no sense  C  except as  B  an example
       1      2    3    4     5      6      7      8      9
              |         |            |             |
              cc       bbbb         cccc           bb 

Hierarchical order:

   1                                  A
           ___________________________|______________________
          |      |    |    |     |      |      |      |      |
   2  This text  C  makes  B  no sense  C  except as  B  an example
                 |         |            |             |
   3             cc       bbbb         cccc           bb 

Document order:

                                   A
                                   1
        ___________________________|______________________
       |      |    |    |     |      |      |      |      |
   This text  C  makes  B  no sense  C  except as  B  an example
       2      3    5    6     8      9      11     12     14
              |         |            |             |
              cc       bbbb         cccc           bb
              4         7            10            13

According to the XML specification, sibling order is significant. In practice, this depends on the application. For example, in a data-centric application, where an XML document is used to populate an object or a table, sibling order usually does not matter. The reason for this is that object-oriented programming systems have no concept of order among their properties. Similarly, relational databases have no concept of order among their columns. Thus, the sibling order is not significant in either of the following documents:

   <Part>
      <Number>123</Number>
      <Desc>Turkey wrench</Desc>
      <Price>10.95</Price>
   </Part>

   <Part>
      <Price>10.95</Price>
      <Desc>Turkey wrench</Desc>
      <Number>123</Number>
   </Part>

both of which can be mapped to the following object and row in a table:

         Objects                                         Tables
   =========================               ===================================
                                                     Table Parts
   object part {                           -------------------------------
      number = 123                ==>      Number  Desc           Price
      desc = "Turkey wrench"               ------  -------------  -----
      price = 10.95                         123    Turkey wrench  10.95

(A major exception to this is when a data-centric document must match a specific DTD. This occurs when an application must validate documents, such as when they come from an unknown or untrusted source. Although "all groups" in XML Schemas help in this situation by allowing a set of children to appear in any order, they are far from a complete solution since they do not support repeated children.)

On the other hand, in document-centric applications, in which documents are generally designed for human consumption, sibling order is very important. For example, I am likely to like the first review and not the second:

   <Review>
      <p>Ronald Bourret is an
      <b>excellent writer</b>.
      Only an <b>idiot</b>
      wouldn't read his work.</p>
   </Review>

   <Review>
      <p>Ronald Bourret is an
      <b>idiot</b>. Only an
      <b>excellent writer</b>
      wouldn't read his work.</p>
   </Review>

The object-relational mapping can preserve sibling order, as will be seen below, although in practice few products support this. It inherently preserves hierarchical order by mapping references to simple element types to columns in a table, and by mapping references to complex element types to primary key / foreign key relationships. It preserves document order when both hierarchical and sibling order are preserved.

3.4.2 Mapping Sibling Order

Because object-oriented programming systems have no concept of order among their properties, and relational databases have no concept of order among their columns, it is necessary to store sibling order values separately from data values. One way to do this is to introduce separate properties and columns in which to store order values. Another way to do this is to store the order values in the mapping itself.

3.4.2.1 Order Properties and Columns

Order properties and order columns are used to store order values. They are separate from data properties and data columns. One property or column is needed for each referenced element type or PCDATA for which order is deemed important. For example, consider the above mixed content example. The following maps the sibling order in a DTD to order properties:

                DTD                                      Classes
   ===============================               ========================

                                                 class A {
                                                    String[] pcdata;
                                                    int[]    pcdataOrder;
   <!ELEMENT A (#PCDATA | B | C)*>                  String[] b;
   <!ELEMENT B (#PCDATA)>               ==>         int[]    bOrder;
   <!ELEMENT C (#PCDATA)>                           String[] c;
                                                    int[]    cOrder;
                                                 }

and then to order columns:

           Classes                                         Tables
   ========================               ========================================

                                                             Table PCDATA
   class A {                                               -----Column a_fk
      String[] pcdata;                                    /     Column pcdata
      int[]    pcdataOrder;                              /      Column pcdataOrder
      String[] b;                         Table A       /    Table B
      int[]    bOrder;           ==>         Column a_pk--------Column a_fk
      String[] c;                                       \       Column b
      int[]    cOrder;                                   \      Column bOrder
   }                                                      \  Table C
                                                           \----Column a_fk
                                                                Column c
                                                                Column cOrder

Notice that the order properties are stored in tables parallel to the properties that they order.

The following example shows order properties being used to preserve sibling order in the "makes-no-sense" example. One important thing to notice here is that all the order properties share the same order space. That is, an order value that appears in one order property won't appear in another order property.

           Classes                                                Tables
   =================================               =====================================

                                                            Table PCDATA
                                                            a_fk pcdata      pcdataOrder
                                                            ---- ----------- -----------
                                                             1   This text   1
   object a {                                                1   makes       3
      pcdata      = {"This text ",                           1   no sense    5
                     " makes ",                              1   except as   7
                     " no sense ",                 Table A   1   an example. 9
                     " except as",                 a_pk       
                     " an example."}      ==>      ----     Table B
      pcdataOrder = {1, 3, 5, 7, 9}                 1       a_fk  b   bOrder
      b           = {"bbbb", "bb"}                          ---- ---- ------
      bOrder      = {4, 8}                                   1   bbbb 4
      c           = {"cc", "cccc"}                           1   bb   8
      cOrder      = {2, 6}          
   }                                                        Table C
                                                            a_fk  c   cOrder
                                                            ---- ---- ------
                                                             1   cc   2
                                                             1   cccc 6

Although order properties are most commonly used to maintain order in mixed content, they can be used with element content as well. For example, consider the following element type definition. Because B can appear an arbitrary number of times in A, it is stored in a separate property table. Without order properties, there would be no way to determine how to order the B children. (Note that row order cannot be used here, as relational databases are not guaranteed to return rows in any particular order.)

   <!ELEMENT A (B*, C)>
3.4.2.2 Storing Order in the Mapping

In many cases, sibling order is important only because of validation. That is, the application itself does not care about sibling order except to be able to validate a document. This is especially true of element content in data-centric documents. In such cases, it may be sufficient to store order information in the mapping itself.

For example, given the following content model, the mapping could store the information that the children of A are ordered B, then C, then D:

   <!ELEMENT A (B, C, D)>

In practice, there are limitations to storing order information in a mapping. For example, consider the following content model:

   <!ELEMENT A (B?, C, B)>

Constructing a document that matches this content model requires software to first decide how much data is available for constructing B elements. If there is only enough data to construct one B element, it won't construct the first B element, since the second B element is required.

It is unlikely that most software will go to the trouble of doing this. Instead, a reasonable limitation is to support only those content models that group all siblings of the same element type together. This is sufficient for many data-centric content models and can be implemented by storing the position of each element in the content model in the mapping.

For example, order of siblings in the following content models can be mapped this way. Note that in the third content model, Author and Editor can both be assigned the same order value or different values; if they are assigned different values, all elements of one type one will occur before any elements of the other type.

   <!ELEMENT Part (Number, Description, Price)>
   <!ELEMENT Order (Number, CustNum, Date, Item*)>
   <!ELEMENT Book (Title, (Author | Editor)+, Price, Review*)>

When order information is stored only in the mapping, round-tripping of documents is not possible whenever the content model contains more than one element of the same type. For example, consider the following content model:

   <!ELEMENT A (B+, C)>

Although the mapping can tell the software that all B elements must occur before the C element, it cannot specify the order of the B elements. Thus, if data is transferred from a document containing this content model to the database and back again, there is no guarantee that the B elements will occur in the same order as in the original document. Fortunately, this is not often a problem for data-centric documents.

3.5 Mapping Attributes

As was seen earlier, attributes are mapped to scalar properties. This section will discuss the details of that mapping, along with other issues.

3.5.1 Mapping Single- and Multi-Valued Attributes

There are two different classes of attributes: single-valued (CDATA, ID, IDREF, NMTOKEN, ENTITY, NOTATION, and enumerated) and multi-valued (IDREFS, NMTOKENS, and ENTITIES). As might be expected, these map to single-valued properties (and then to columns) and multi-valued properties (and then to property tables). For example:

               DTD                               Classes                    Tables
   ============================                ============               ===========

   <!ELEMENT A (B, C)>                         class A {                  Table A
   <!ATTLIST A                                    String b;                  Column B
             D CDATA #REQUIRED>      ==>          String c;      ==>         Column C
   <!ELEMENT B (#PCDATA)>                         String d;                  Column D
   <!ELEMENT C (#PCDATA)>                      }

and:

               DTD                            Classes                    Tables
   ========================               ==============               ==============

   <!ELEMENT A (B, C)>                    class A {                    Table A
   <!ATTLIST A                               String   b;                  Column a_pk
         D IDREFS #IMPLIED>      ==>         String   c;      ==>         Column b
   <!ELEMENT B (#PCDATA)>                    String[] d;                  Column c
   <!ELEMENT C (#PCDATA)>                 }                            Table D
                                                                          Column a_fk
                                                                          Column d

The order in which attributes occur is not significant according to the XML Information Set. For example, the following two XML documents are considered identical. Because of this, there is no need to use order properties to maintain the order in which attributes occur, although it would certainly be possible to do so.

   <A B="bbb"
      C="ccc"
      D="ddd"/>

   <A C="ccc"
      B="bbb"
      D="ddd"/>

On the other hand, the order in which values occur in multi-valued attributes is considered significant. As was the case with sibling elements and PCDATA, order properties can be used to maintain information about the order in which values in a multi-valued attribute occur. However, there is one important difference between the order properties used for sibling elements and PCDATA and those used for multi-valued attributes, and this is that each order property used for a multi-valued attribute has its own order space. This can be seen in the following example:

        XML                       Objects
   ===============             =========================

                               object a {
   <A B="dd ee ff"                b = {"dd", "ee", "ff"}
      C="gg hh"/>      ==>        bOrder = {1, 2, 3}
                                  c = {"gg", "hh"}
                                  cOrder = {1, 2}
                               }

Alert readers will note that order properties are not strictly necessary at the object level -- array order could be used instead. However, they are needed in relational databases, since there is no concept of row order there.

3.5.2 Mapping ID/IDREF(S) Attributes

ID attributes are used to uniquely identify elements in an XML document. IDREF and IDREFS attributes are used to associate one element with another by referring to the latter element's ID. This is usually done when it is not possible to form this association by nesting one element inside another. For example, consider the following directed graph:

          A
         / \
        B   C
         \ /
          D

This can be represented in an XML document as:

   <A>
      <B ref_d="1">
         ...
      </B>
      <C ref_d="1">
         ...
      </C>
      <D id="1">
         ...
      </D>
   </A>

ID/IDREF(S) attributes map to primary key / foreign key relationships. For example, the above document could be stored in the database with the following tables and columns:

               Table A
                  Column a_pk
                  ...
                /     \
               /       \
   Table B              Table C
      Column a_fk          Column a_fk
      Column ref_d         Column ref_d
      ...                  ...
              \         /
               \       /
               Table D
                  Column a_fk
                  Column id
                  ...

One thing that data transfer software needs to be careful of when storing ID/IDREF(S) attributes in a database is that IDs are only guaranteed to be unique inside a given XML document. Thus, if the data from more than one document is stored in the same table, there is no guarantee that the IDs will be unique. The solution to this problem is to somehow "decorate" the IDs. This could be done by mapping the attributes to two columns, one of which contains a value that is unique to each document and the other of which contains the ID, or by decorating the ID itself, such as prefixing it with a unique value.

A similar problem exists when transferring data from the database to a XML document. If the retrieved data originates from more than one document, the data transfer software needs to ensure that the ID values are unique. This might involve changing one or more values, along with the values of any IDREF(S) attributes that reference them.

Currently, most products do not support ID/IDREF attributes as distinct from other attributes.

3.5.3 Mapping Notation Attributes

Notations are used in XML to alert the application of how an element or unparsed entity is to be treated. For example, the following "xhtml" notation might tell the application that the element contains XHTML and should be displayed by a browser:

   <Desc type="xhtml">
      <html>
      <head><title>Turkey wrench</title></title></head>
      <body>
      <p>A very <b>big</b> turkey wrench.</p>
      </body>
      </html>
   </Desc>

Notation attributes and their values are generally of no interest to the object-relational mapping. That is, they are treated as simply another attribute.

The only exception to this occurs when the notation indicates the data type of the contained text. For example, the notation "base64" might tell the application that an element contains binary data encoded as Base64 (a MIME encoding that maps binary data to a subset of US-ASCII). In most cases, this information will only be of interest to software that generates mappings. For example, it could use the information to map the element type to a binary-valued property and then to a BLOB (Binary Large OBject). In such cases, the mapping itself does not use this information. That is, the mapping from element to BLOB is independent of the fact that a notation contains data type information.

The only exception to this is when the data transfer software is sophisticated enough to switch mappings at run time based on the notation value. In this case, each possible notation is mapped to a data type, which is then used to convert the data.

3.5.4 Mapping ENTITY/ENTITIES Attributes

ENTITY and ENTITIES attributes are used to associate unparsed, external data (such as a binary file) with an XML document. These are mapped the same as any other attribute, except that, when the data is transferred, the entity may be substituted for the attribute value (when transferring data from XML to the database) or a new entity may be created and its identifier stored as the attribute value (when transferring data from the database to XML). Because unparsed entity values can be generated dynamically, it is a good idea for the mapping to specify whether the value or the entity URI is to be stored in the database.

Because unparsed entities always have associated notations, it is possible to use those notations when determining the data type of the entity (either at map time or run time).

3.6 Alternate Mappings

In the previous sections, we have described how to map DTDs to databases. In fact, this description has not been complete, as there are a number of other ways we could have done the mapping. In this section, we will discuss two of the most significant alternatives.

3.6.1 Mapping Complex Element Types to Scalar Types

Although complex element types are normally mapped to classes and then to tables, it is possible to map them to scalar types. That is, references to complex element types can be mapped to scalar properties. The value of such properties is generally the element content, serialized as XML. This is useful when the element's value makes sense only as a whole and shouldn't be broken into smaller parts.

For example, consider an XML document that gives information about a part. If one of the child elements is a part description in XHTML, it probably makes no sense to fragment this any further. As we have already seen, this will result in the data being scattered across numerous tables -- one for italic words, one for bold words, one for words used in hyperlinks, etc. Therefore, it is better to store this description in a single column:

             DTD                               Classes                      Tables
   ===========================              ===============             ==============

   <!ELEMENT Part (Num, Desc)>              class Part {                Table Part
   <!ELEMENT Number (#PCDATA)>      ==>        String num;      ==>        Column num
   <!-- Use Inline entity                      String desc;                Column desc
        from XHTML -->                      }
   <!ELEMENT Desc (%Inline;)>

For example, the following description is stored as follows:

   <Part>
      <Number>127</Number>      Table Part
      <Desc>                    Num  Desc
         A very <b>big</b>  =>  ---  ----------------
         turkey wrench.         127  A very <b>big</>
      </Desc>                        turkey wrench.
   </Part>

Note that storing data as XML does cause problems for data transfer software. In particular, the software cannot distinguish between markup and data. For example, how does the application determine if the <b> in the following text is a <b> element or text?

   An example of the <b> element is <b>this element</b>.

One solution to this is to store actual elements in the database using tags and character data using entity references:

   An example of the &lt;b&gt; element is <b>this element</b>.

The problem with this is that non-XML applications can't then search the database as they might expect.

3.6.2 Mapping Complex Element Types to XML Columns

The problems discussed at the end of the previous section can be solved by mapping complex element types to columns whose data type is XML. The XML data type is a first class data type that is backed by some sort of native XML storage: indexed text, relational tables designed to store XML, or proprietary (non-relational) storage designed specifically to store XML. Because the XML data type is designed specifically to store XML, it distinguishes between markup and character data. Furthermore, data in XML columns can be queried as XML, usually through XQuery or XPath statements embedded in SQL statements.

The XML data type is supported by some relational databases, especially the major commercial relational databases.

3.6.3 Mapping Scalar Properties to Property Tables

Although single, scalar-valued properties are usually mapped to columns, they can be mapped to property tables. This is useful, for example, for storing BLOBs or infrequently used properties in a separate table from the main table. For example:

      Classes                        Tables
   ===============             ==================

   class Part {                Table Parts
      String num;      ==>        Column num
      String desc;
                               Table Descriptions
   }                              Column num
                                  Column desc

3.6 Conclusions

The object-relational mapping works for all XML documents, maps well to objects, and allows non-XML applications to use the data in the database. Because of this, it is a good choice for storing data-centric documents and (not surprisingly) is used as the underlying model in some middleware, most XML-enabled relational databases, and most XML-enabled object servers.

It should be noted that all of these products implement slightly different versions of the object-relational mapping and none implements all possibilities in the mapping. Of the more common parts of the mapping, all map complex element types to classes and references to element types to properties, as well as using primary key / foreign key pairs to join tables. However, some map columns only to PCDATA-only elements, others map columns only to attributes, and still others allow both. Similarly, most of these products do not support sibling order or mixed content, and many do not allow users to change names during the mapping.

The object-relational mapping is a poor choice for document-centric documents. First, it is an inefficient when used with mixed content. Second, like the table-based mapping, it does not preserve physical structure, comments, or processing instructions.

4.0 Generating Schema

We now consider how to generate relational database schema from a DTD according to the object-relational mapping and vice versa. Since there are a number of possible paths through the object-relational mapping, the algorithms here simply choose the most commonly used branch each time there is a choice. For example, single references to PCDATA-only elements can be mapped to a column or a separate property table. Since the most common choice is to map them to a column, the following algorithm generates a column from such references.

For object-oriented databases, the generation process is similar.

4.1 Generating Relational Database Schema from DTDs

Relational schemas are generated by reading through the DTD and processing each element type:

To process a content model:

To process attributes:

The following example shows how this process works. Consider the following DTD:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>
   <!ELEMENT OrderNum (#PCDATA)>
   <!ELEMENT Date (#PCDATA)>
   <!ELEMENT CustNum (#PCDATA)>
                                                          
   <!ELEMENT Item (ItemNum, Quantity, Part)>
   <!ELEMENT ItemNum (#PCDATA)>
   <!ELEMENT Quantity (#PCDATA)>
                                                          
   <!ELEMENT Part (PartNum, Price)>
   <!ELEMENT PartNum (#PCDATA)>
   <!ELEMENT Price (#PCDATA)>

In the first step, we generate tables for complex element types and primary keys for these tables:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>      ==>      Table Order
   <!ELEMENT OrderNum (#PCDATA)>                                      Column OrderPK
   <!ELEMENT Date (#PCDATA)>
   <!ELEMENT CustNum (#PCDATA)>
                                                         
   <!ELEMENT Item (ItemNum, Quantity, Part)>              ==>      Table Item
   <!ELEMENT ItemNum (#PCDATA)>                                       Column ItemPK
   <!ELEMENT Quantity (#PCDATA)>
                                                          
   <!ELEMENT Part (PartNum, Price)>                       ==>      Table Part
   <!ELEMENT PartNum (#PCDATA)>                                       Column PartPK
   <!ELEMENT Price (#PCDATA)>

In the second step, we generate columns for references to simple element types:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>      ==>      Table Order
   <!ELEMENT OrderNum (#PCDATA)>                                      Column OrderPK
   <!ELEMENT Date (#PCDATA)>                                          Column OrderNum
   <!ELEMENT CustNum (#PCDATA)>                                       Column Date
                                                                      Column CustNum
                                                          
   <!ELEMENT Item (ItemNum, Quantity, Part)>              ==>      Table Item
   <!ELEMENT ItemNum (#PCDATA)>                                       Column ItemPK
   <!ELEMENT Quantity (#PCDATA)>                                      Column ItemNum
                                                                      Column Quantity

   <!ELEMENT Part (PartNum, Price)>                       ==>      Table Part
   <!ELEMENT PartNum (#PCDATA)>                                       Column PartPK
   <!ELEMENT Price (#PCDATA)>                                         Column PartNum
                                                                      Column Price

In the final step, we generate foreign keys for references to complex element types:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>      ==>      Table Order
   <!ELEMENT OrderNum (#PCDATA)>                                      Column OrderPK
   <!ELEMENT Date (#PCDATA)>                                          Column OrderNum
   <!ELEMENT CustNum (#PCDATA)>                                       Column Date
                                                                      Column CustNum
                                                          
   <!ELEMENT Item (ItemNum, Quantity, Part)>              ==>      Table Item
   <!ELEMENT ItemNum (#PCDATA)>                                       Column ItemPK
   <!ELEMENT Quantity (#PCDATA)>                                      Column ItemNum
                                                                      Column Quantity
                                                                      Column OrderFK
                                                          
   <!ELEMENT Part (PartNum, Price)>                       ==>      Table Part
   <!ELEMENT PartNum (#PCDATA)>                                       Column PartPK
   <!ELEMENT Price (#PCDATA)>                                         Column PartNum
                                                                      Column Price
                                                                      Column ItemFK

It is important to note that the generated schema is not what a human would have generated. In addition to naming problems (for example, a person might have called the tables Orders, Items, and Parts), the generation algorithm was unable to determine that the relationship between Items and Parts was many-to-one. The algorithm was also unable to recognize that OrderNum and PartNum could be used as primary keys and it could not determine data types and column lengths, although XML Schemas will solve the latter problem. Although no naming collisions occurred or illegal names were generated, both are a possibility.

4.2 Generating DTDs from Database Schema

DTDs are generated by starting from a single "root" table or set of root tables and processing each:

Primary and foreign keys are generated as follows:

During this process, the key used to reach the table (if any) is not processed. This saves the algorithm from generating element types that duplicate those created in the parent table.

The following example shows how this process works. Consider the following database schema:

   Table Orders
      Column OrderNum
      Column Date
      Column CustNum
 
   Table Items
      Column OrderNum
      Column ItemNum
      Column Quantity
      Column PartNum

   Table Parts
      Column PartNum
      Column Price

In our first step, we generate an element type for the root table (Orders):

        Tables                                            DTD
   ==================               ===================================================

   Table Orders            ==>      <!ELEMENT Orders ()>
      Column OrderNum
      Column Date
      Column CustNum
 
   Table Items
      Column OrderNum
      Column ItemNum
      Column Quantity
      Column PartNum

   Table Parts
      Column PartNum
      Column Price

Next, we generate PCDATA-only elements for the data columns (Date and CustNum) and add references to these elements to the content model of the Orders element:

        Tables                                            DTD
   ==================               ===================================================

   Table Orders            ==>      <!ELEMENT Orders (Date, CustNum)>
      Column OrderNum
      Column Date                   <!ELEMENT Date (#PCDATA)>
      Column CustNum                <!ELEMENT CustNum (#PCDATA)>
 
   Table Items
      Column OrderNum
      Column ItemNum
      Column Quantity
      Column PartNum

   Table Parts
      Column PartNum
      Column Price

Now we generate a PCDATA-only element for the primary key (OrderNum) and add a reference to it to the content model ...

        Tables                                            DTD
   ==================               ===================================================

   Table Orders            ==>      <!ELEMENT Orders (Date, CustNum, OrderNum)>
      Column OrderNum               <!ELEMENT OrderNum (#PCDATA)>
      Column Date                   <!ELEMENT Date (#PCDATA)>
      Column CustNum                <!ELEMENT CustNum (#PCDATA)>
 
   Table Items
      Column OrderNum
      Column ItemNum
      Column Quantity
      Column PartNum

   Table Parts
      Column PartNum
      Column Price

... then add an element for the table (Items) to which the primary key is exported, as well as a reference to it in the content model:

        Tables                                            DTD
   ==================               ===================================================

   Table Orders                     <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)>
      Column OrderNum               <!ELEMENT OrderNum (#PCDATA)>
      Column Date                   <!ELEMENT Date (#PCDATA)>
      Column CustNum                <!ELEMENT CustNum (#PCDATA)>
 
   Table Items             ==>      <!ELEMENT Items()>
      Column OrderNum
      Column ItemNum
      Column Quantity
      Column PartNum

   Table Parts
      Column PartNum
      Column Price

We process the data and primary key columns in the remote (Items) table in the same way ...

        Tables                                            DTD
   ==================               ===================================================

   Table Orders                     <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)>
      Column OrderNum               <!ELEMENT OrderNum (#PCDATA)>
      Column Date                   <!ELEMENT Date (#PCDATA)>
      Column CustNum                <!ELEMENT CustNum (#PCDATA)>
 
   Table Items             ==>      <!ELEMENT Items(ItemNum, Quantity)>
      Column OrderNum
      Column ItemNum                <!ELEMENT ItemNum (#PCDATA)>
      Column Quantity               <!ELEMENT Quantity (#PCDATA)>
      Column PartNum

   Table Parts
      Column PartNum
      Column Price

... then add an element for the table (Parts) to which the foreign key corresponds:

        Tables                                            DTD
   ==================               ===================================================

   Table Orders                     <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)>
      Column OrderNum               <!ELEMENT OrderNum (#PCDATA)>
      Column Date                   <!ELEMENT Date (#PCDATA)>
      Column CustNum                <!ELEMENT CustNum (#PCDATA)>
 
   Table Items                      <!ELEMENT Items(ItemNum, Quantity, Parts)>
      Column OrderNum
      Column ItemNum                <!ELEMENT ItemNum (#PCDATA)>
      Column Quantity               <!ELEMENT Quantity (#PCDATA)>
      Column PartNum

   Table Parts             ==>      <!ELEMENT Parts()>
      Column PartNum
      Column Price

Finally, we process the foreign key table (Parts):

        Tables                                            DTD
   ==================               ===================================================

   Table Orders                     <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)>
      Column OrderNum               <!ELEMENT OrderNum (#PCDATA)>
      Column Date                   <!ELEMENT Date (#PCDATA)>
      Column CustNum                <!ELEMENT CustNum (#PCDATA)>
 
   Table Items                      <!ELEMENT Items (ItemNum, Quantity, Parts)>
      Column OrderNum
      Column ItemNum                <!ELEMENT ItemNum (#PCDATA)>
      Column Quantity               <!ELEMENT Quantity (#PCDATA)>
      Column PartNum

   Table Parts             ==>      <!ELEMENT Parts(PartNum, Price)>
      Column PartNum                <!ELEMENT PartNum (#PCDATA)>
      Column Price                  <!ELEMENT Price (#PCDATA)>

As was the case in the previous section, the generated DTD is not what a human would have created. Although the only problems here are with names, the algorithm cannot recognize order columns or property tables.

5.0 Mapping XML Schemas to Databases

Most XML schema languages can be mapped to databases with an object-relational mapping. The exact mappings depend on the language. DDML, DCD, and XML Data Reduced schemas can be mapped in a manner almost identical to DTDs. The mappings for W3C Schemas, Relax, TREX, and SOX appear to be somewhat more complex. It is not clear to me whether Schematron can be mapped.

In the case of W3C Schemas, a complete mapping to object schemas and then to database schemas is available. Briefly, this maps complex types to classes (with complex type extension mapped to inheritance) and maps simple types to scalar data types (although many facets are lost). "All" groups are treated like unordered sequences and substitution groups are treated like choices. Finally, most identity constraints are mapped to keys. For complete details, see Mapping W3C Schemas to Object Schemas to Relational Schemas.

For a wider discussion of XML and databases, see:

For a now out-of-date list of XML database products, see: