Wrox Home  
Search
Ivor Horton's Beginning Java 2, JDK 5 Edition
by Ivor Horton
December 2004, Paperback


Excerpt from Ivor Horton's Beginning Java 2, JDK 5 Edition

Programming with XML Documents

An XML processor is a software module that an application uses to read an XML document and gain access to the data and its structure. An XML processor parses the contents of a document and makes the elements, together with their attributes and content, available to the application, so it is also referred to as an XML parser. In case you haven't met the term before, a parser is just a program module that breaks down text in a given language into its component parts. A natural language processor would have a parser that identifies the grammatical segments in each sentence. A compiler has a parser that identifies variables, constants, operators, and so on in a program statement. An application accesses the content of a document through an API provided by an XML parser, and the parser does the job of figuring out what the document consists of.

Java supports two complementary APIs for processing an XML document:

  • SAX, which is the Simple API for XML parsing
  • DOM, which is the Document Object Model for XML

The support in JDK 5.0 is for DOM level 3 and for SAX version 2.0.2. JDK 5.0 also supports XSLT version 1.0, where XSL is the Extensible Stylesheet Language and T is Transformations-a language for transforming one XML document into another, or into some other textual representation such as HTML. However, this writing concentrates on the basic application of DOM and SAX. XSLT is such an extensive topic that there are several books devoted entirely to it.

Let's look at the broad differences between SAX and DOM, and get an idea of the circumstances in which you might choose to use one rather than the other.

SAX Processing

SAX uses an event-based process for reading an XML document that is implemented through a callback mechanism. This is very similar to the way in which you handle GUI events in Java. As the parser reads a document, each parsing event, such as recognizing the start or end of an element, results in a call to a particular method associated with that event. Such a method is often referred to as a handler. It is up to you to implement these methods to respond appropriately to the event. Each of your methods then has the opportunity to react to the event, which will result in it being called in any way that you wish. In Figure 1 you can see the events that would arise from a previous XML document example.

Figure 1
Figure 1

Each type of event results in a different method in your program being called. There are, for example, different events for registering the beginning and end of a document. You can also see that the start and end of each element results in two further kinds of events, and another type of event occurs for each segment of document data. Thus, this particular document will involve five different methods in your program being called-some of them more than once, of course, so there is one method for each type of event.

Because of the way SAX works, your application inevitably receives the document a piece at a time, with no representation of the whole document. This means that if you need to have the whole document available to your program with its elements and content properly structured, you have to assemble it yourself from the information supplied piecemeal to your callback methods.

Of course, it also means that you don't have to keep the entire document in memory if you don't need it, so if you are just looking for particular information from a document, all <phonenumber> elements, for example, you can just save those as you receive them through the callback mechanism, and discard the rest. As a consequence, SAX is a particularly fast and memory efficient way of selectively processing the contents of an XML document.

First of all, SAX itself is not an XML document parser; it is a public domain definition of an interface to an XML parser, where the parser is an external program. The public domain part of the SAX API is in three packages that are shipped as part of the JDK:

  • org.xml.sax — This defines the Java interfaces specifying the SAX API and the InputSource class that encapsulates a source of an XML document to be parsed.
  • org.xml.sax.helpers — This defines a number of helper classes for interfacing to a SAX parser.
  • org.xml.sax.ext — This defines interfaces representing optional extensions to SAX2 to obtain information about a DTD, or to obtain information about comments and CDATA sections in a document.

In addition to these, the javax.xml.parsers package provides factory classes that you use to gain access to a parser, and the javax.xml.transform package defines interfaces and classes for XSLT 1.0 processing of an XML document.

In Java terms there are several interfaces involved. The XMLReader interface defined in the org.xml.sax package specifies the methods that the SAX parser will call, as it recognizes elements, attributes, and other components of an XML document. You must provide a class that implements these methods and responds to the method calls in the way that you want.

DOM Processing

DOM works quite differently than SAX. When an XML document is parsed, the whole document tree is assembled in memory and returned to your application as an object of type Document that encapsulates it, as Figure 2 illustrates.

Figure 2
Figure 2

Once you have the Document object available, you can call the Document object's methods to navigate through the elements in the document tree starting with the root element. With DOM, the entire document is available for you to process as often and in as many ways as you want. This is a major advantage over SAX processing. The downside to this is the amount of memory occupied by the document-there is no choice, you get it all, no matter how big it is. With some documents the amount of memory required may be prohibitively large.

DOM has one other unique advantage over SAX. It allows you to modify existing documents or create new ones. If you want to create an XML document programmatically and then transfer it to an external destination such as a file or another computer, DOM is the API for this since SAX has no direct provision for creating or modifying XML documents.

Accessing Parsers

The javax.xml.parsers package defines four classes supporting the processing of XML documents:

SAXParserFactory Enables you to create a configurable factory object that you can use to create a SAXParser object encapsulating a SAX-based parser
SAXParser Defines an object that wraps a SAX-based parser
DocumentBuilderFactory Enables you to create a configurable factory object that you can use to create a DocumentBuilder object encapsulating a DOM-based parser
DocumentBuilder Defines an object that wraps a DOM-based parser

All four classes are abstract. This is because JAXP is designed to allow different parsers and their factory classes to be plugged in. Both DOM and SAX parsers are developed independently of the Java JDK so it is important to be able to integrate new parsers as they come along. The Xerces parser that is currently distributed with the JDK is controlled and developed by the Apache Project, and it provides a very comprehensive range of capabilities. However, you may want to take advantage of the features provided by other parsers from other organizations, and JAXP allows for that.

These abstract classes act as wrappers for the specific factory and parser objects that you need to use for a particular parser and insulate your code from a particular parser implementation. An instance of a factory object that can create an instance of a parser is created at run time, so your program can use a different parser without changing or even recompiling your code. Now that you have a rough idea of the general principles, let's get down to specifics and practicalities, starting with SAX.

Using SAX

To process an XML document with SAX, you first have to establish contact with the parser that you want to use. The first step toward this is to create a SAXParserFactory object like this:

SAXParserFactory spf = SAXParserFactory.newInstance();

The SAXParserFactory class is defined in the javax.xml.parsers package along with the SAXParser class that encapsulates a parser. The SAXParserFactory class is abstract but the static newInstance() method will return a reference to an object of a class type that is a concrete implementation of SAXParserFactory. This will be the factory object for creating an object encapsulating a SAX parser.

Before you create a parser object, you can condition the capabilities of the parser object that the SAXParserFactory object will create. For example, the SAXParserFactory object has methods for determining whether the parser that it will attempt to create will be namespace aware or will validate the XML as it is parsed:

isNamespaceAware() Returns true if the parser to be created is namespace aware, and false otherwise
isValidating() Returns true if the parser to be created will validate the XML during parsing, and false otherwise

You can set the factory object to produce namespace aware parsers by calling its setNamespaceAware() method with an argument value of true. An argument value of false sets the factory object to produce parsers that are not namespace aware. A parser that is namespace aware can recognize the structure of names in a namespace-with a colon separating the namespace prefix from the name. A namespace aware parser can report the URI and local name separately for each element and attribute. A parser that is not namespace aware will report only an element or attribute name as a single name even when it contains a colon. In other words, a parser that is not namespace aware will treat a colon as just another character that is part of a name.

Similarly, calling the setValidating() method with an argument value of true will cause the factory object to produce parsers that can validate the XML as a document is parsed. A validating parser can verify that the document body has a DTD or a schema, and that the document content is consistent with the DTD or schema identified within the document.

You can now use the SAXParserFactory object to create a SAXParser object as follows:

SAXParser parser = null;
try {
 parser = spf.newSAXParser();
}catch(SAXException e){
  e.printStackTrace(System.err);
  System.exit(1);
} catch(ParserConfigurationException e) {
  e.printStackTrace(System.err);
  System.exit(1);
}

The SAXParser object that you create here will encapsulate the parser supplied with the JDK. The newSAXParser() method for the factory object can throw the two exceptions you are catching here. A ParserConfigurationException will be thrown if a parser cannot be created consistent with the configuration determined by the SAXParserFactory object, and a SAXException will be thrown if any other error occurs. For example, if you call the setValidating() option and the parser does not have the capability for validating documents, this exception would be thrown. This should not arise with the parser supplied with the JDK though, because it supports both of these features.

The ParserConfigurationException class is defined in the javax.xml.parsers package and the SAXException class is in the org.xml.sax package. Now let's see what the default parser is by putting the code fragments you have looked at so far together in a working example.

Try It Out — Accessing a SAX Parser

Here's the code to create a SAXParser object and output some details about it to the command line:

import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.SAXException;
public class TrySAX {
  public static void main(String args[]) {
    // Create factory object
    SAXParserFactory spf = SAXParserFactory.newInstance(); 
    System.out.println("Parser will "+(spf.isNamespaceAware()?"":"not ") + 
                       "be namespace aware");
    System.out.println("Parser will "+(spf.isValidating()?"":"not ") +
                       "validate XML");

    SAXParser parser = null;               // Stores parser reference
    try {
     parser = spf.newSAXParser();          // Create parser object
    }catch(ParserConfigurationException e){// Thrown if a parser 
					// cannot be created
                                           // that is consistent with the 
      e.printStackTrace(System.err);       // configuration in spf
      System.exit(1);    
    } catch(SAXException e) {             // Thrown for any other error
      e.printStackTrace(System.err);
      System.exit(1);    
    } 

    System.out.println("Parser object is: "+ parser);
  }
}

When you run this, you get the following output:

Parser will not be namespace aware
Parser will not validate XML
Parser object is: com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl@118f375

How It Works

The output shows that the default configuration for the SAX parser produced by the SAXParserFactory object spf will be neither namespace aware nor validating. The parser supplied with the JDK is the Xerces parser from the XML Apache Project. This parser implements the W3C standard for XML, the de facto SAX2 standard, and the W3C DOM standard. It also provides support for the W3C standard for XML Schema. You can find detailed information on the advantages of this particular parser on the http://xml.apache.org web site.

The code to create the parser works as previously have already discussed. Once you have an instance of the factory method, you use that to create an object encapsulating the parser. Although the reference is returned as type SAXParser, the object is of type SAXParserImpl, which is a concrete implementation of the abstract SAXParser class for a particular parser.

The Xerces parser is capable of validating XML and can be namespace aware. All you need to do is specify which of these options you require by calling the appropriate method for the factory object. You can set the parser configuration for the factory object spf so that you get a validating and namespace aware parser by adding two lines to the program:

// Create factory object
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(true);

If you compile and run the code again, you should get output something like:

Parser will be namespace aware
Parser will validate XML
Parser object is: com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl@867e89

You arrive at a SAXParser instance without tripping any exceptions, and you clearly now have a namespace aware and validating parser. By default the Xerces parser will validate an XML document with a DTD. To get it to validate a document with an XML Schema, you need to set another option for the parser.