Introduction to StAXProcessing XML documents has become a critical and integral part of most applications being developed today. Depending on different environments, there are various ways to process an XML document within a program. All these can be broadly categorized into two approaches:
Both of these models give little or no control to the user in this parsing process. Once started, whether tree-based or event-based, both parsing approaches consume the whole data stream at once. JSR 173 defines a pull streaming model, StAX (short for "Streaming API for XML"), for processing XML documents. In this model, unlike in SAX, the client can start, proceed, pause, and resume the parsing process. The client has complete control.
Where will this come in handy? Think about an XML processing engine, like Apache Axis2, where memory and performance are critical parameters. Upon the receipt of a SOAP message, one may need to parse the message depending on some criteria and constrain the parse to various levels. This can easily be done with a parser that gives control to the client.
Now let's look at how this is achieved and how the API accommodates such a model.
The client first needs to create a parser giving the XML he needs to parse, as a java.io.InputStream, java.io.Reader, or javax.xml.transform.Source. Then the user should ask the parser to proceed by calling the next() method. Each call to this method will emit one of the events listed below.
XMLStreamConstants.START_ELEMENT
XMLStreamConstants.END_ELEMENT
XMLStreamConstants.PROCESSING_INSTRUCTION
XMLStreamConstants.CHARACTERS
XMLStreamConstants.COMMENT
XMLStreamConstants.SPACE
XMLStreamConstants.START_DOCUMENT
XMLStreamConstants.END_DOCUMENT
XMLStreamConstants.ENTITY_ REFERENCE
XMLStreamConstants.ATTRIBUTE
XMLStreamConstants.DTD
XMLStreamConstants.CDATA
XMLStreamConstants.NAMESPACE
XMLStreamConstants.NOTATION_DECLARATION
XMLStreamConstants.ENTITY_DECLARATION
Depending on the event, one can get more information by calling other corresponding methods appropriate to the event. For example, if the START_ELEMENT event is thrown, then calling getLocalName() will return the local name of the element. Here's a list of corresponding methods that can be called for a given event.
| Event | Valid Methods |
|---|---|
| All states | getProperty(), hasNext(), require(), close(), getNamespaceURI(), isStartElement(), isEndElement(), isCharacters(), isWhiteSpace(), getNamespaceContext(), getEventType(),getLocation(), hasText(), hasName()
|
START_ELEMENT
|
next(), getName(), getLocalName(), hasName(), getPrefix(), getAttributeXXX(), isAttributeSpecified(), getNamespaceXXX(), getElementText(), nextTag()
|
ATTRIBUTE
|
next(), nextTag() getAttributeXXX(), isAttributeSpecified()
|
NAMESPACE
|
next(), nextTag() getNamespaceXXX()
|
END_ELEMENT
|
next(), getName(), getLocalName(), hasName(), getPrefix(), getNamespaceXXX(), nextTag()
|
CHARACTERS
|
next(), getTextXXX(), nextTag()
|
CDATA
|
next(), getTextXXX(), nextTag()
|
COMMENT
|
next(), getTextXXX(), nextTag()
|
SPACE
|
next(), getTextXXX(), nextTag()
|
START_DOCUMENT
|
next(), getEncoding(), getVersion(), isStandalone(), standaloneSet(), getCharacterEncodingScheme(), nextTag()
|
END_DOCUMENT
|
close()
|
PROCESSING_INSTRUCTION
|
next(), getPITarget(), getPIData(), nextTag()
|
ENTITY_REFERENCE
|
next(), getLocalName(), getText(), nextTag()
|
DTD
|
next(), getText(), nextTag()
|
Now let's see how we can play around a bit with the StAX API. We need to download StAX API .jar and an implementation of the StAX API. Both are available from Ibiblio; the StAX API .jar is stax-api-1.0.jar. As for the implementation, there are several available. Let's use the woodstox implementation available as wstx-asl-2.9.2.jar.
Let's first print events from sample1.xml, which can be found in the sample code folder in the Resources section.
<article:Article xmlns:article="http://www.article.org"
xmlns:author="http://author.org">
<!-- This sample1.xml is used for samples in
"Introducing StAX" article -->
<Name>Introducing StAX</Name>
<author:Author>Eran Chinthaka</author:Author>
<?This_is_some_processing_instruction?>
</article:Article>
First you need to create an instance of XMLStreamReader. The StAX API provides XMLInputFactory to create an instance of XMLStreamReader.
FileInputStream fileInputStream =
new FileInputStream(fileLocation);
XMLStreamReader xmlStreamReader =
XMLInputFactory.newInstance().
createXMLStreamReader(fileInputStream);
Then we need to ask the parser to proceed through each event. XMLStreamReader provides an iterator-like API to check the existence of a next event.
while (xmlStreamReader.hasNext()) {
printEventInfo(xmlStreamReader);
}
xmlStreamReader.close();
This code will iterate until xmlStreamReader has no further events to be thrown. Note that closing the xmlStreamReader instance is not required, but is considered good programming practice.
Now we need to get the events from the parser and call appropriate methods to extract information about the XML.
int eventCode = reader.next();
switch (eventCode) {
case 1 :
System.out.println("event = START_ELEMENT");
System.out.println("Localname = "+reader.getLocalName());
break;
case 2 :
System.out.println("event = END_ELEMENT");
System.out.println("Localname = "+reader.getLocalName());
break;
case 3 :
System.out.println("event = PROCESSING_INSTRUCTION");
System.out.println("PIData = " + reader.getPIData());
break;
..............................
..............................
..............................
The interesting thing to note here is that the user must call the parser to proceed by calling reader.next(). The parser will proceed to the next step only after that. This is the main difference between pull and push parsing. In push parsing, as with SAX, once the SAX parser starts sending events, the user or the client application has no control over it. But in pull parsing, as seen here, the client application can decide the phase of parsing at its own discretion.
Say you want to process only one element of the XML, if present. In this approach you put a simple if statement in the START_ELEMENT handling code and you are done. If you do not want to process any XML after that, you can simply close the stream and forget about it, rather than having to parse the all of the XML.
One typical example of this kind of processing is when you relay pieces of XML. Most of the time the intermediary node will look for a particular XML element and will then forward it to the proper destination, without requiring parsing of the whole XML chunk.
When you run the above piece of code against sample1.xml, the output will be as follows:
event = START_ELEMENT
Localname = Article
========================
event = COMMENT
Comment = This sample1.xml is used for samples in "Introducing StAX" article
========================
event = START_ELEMENT
Localname = Name
========================
event = CHARACTERS
Characters = Introducing StAX
========================
event = END_ELEMENT
Localname = Name
========================
event = START_ELEMENT
Localname = Author
========================
event = CHARACTERS
Characters = Eran Chinthaka
========================
event = END_ELEMENT
Localname = Author
========================
event = PROCESSING_INSTRUCTION
PIData =
========================
event = END_ELEMENT
Localname = Article
========================
event = END_DOCUMENT
Document Ended
========================
Now let's try to write the same XML to the output using XMLStreamWriter interface.
In just the same way as we create XMLStreamReader to read XML using the XMLInputFactory, we need to create an instance of XMLStreamWriter using the XMLOutputFactory.
XMLStreamWriter writer = XMLOutputFactory.newInstance().
createXMLStreamWriter(outStream);
Then this writer can be used to write events. For example:
writer.writeStartElement("Name")writer.writeEndElement()writer.writeComment("This sample1.xml is used for samples in \"Introducing StAX\" article")writer.writeNamespace("author", "http://author.org")writer.writeCharacters("Introducing StAX")writer.writeProcessingInstruction("This_is_a_processing_instruction")Having written these events to the XMLStreamWriter you must flush and close the writer.
writer.flush();
writer.close();
StAX contains two distinct APIs to work with XML. One is the cursor API and the other is the iterative API. What we have discussed so far is the cursor API. As you can see, the cursor API always points to one thing at a time and it always moves forward, and never goes backward. The iterator API, on the other hand, tries to visualize the XML stream as a set of event objects. The base iterator API is called XMLEvent, and there are subinterfaces for each event type. The XMLEventReader interface has following methods to interact with the XML info set.
public XMLEvent nextEvent() throws XMLStreamException;
public boolean hasNext();
public XMLEvent peek() throws XMLStreamException;
More information on the iterator API can be found in Sun's online tutorial.
Most of the applications that process XML benefit from stream parsing and most of the time do not require the entire DOM model in memory. Having mentioned that as the main advantage we have in pull parsing, let's look at the other aspects.
In the end, how does StAX compare with some of the existing XML parsing technologies available today? A table in Jeff Ryan's "Does StAX Belong in Your XML Toolbox?" does a good job of assessing the various approaches, in terms of API style, ease of use, CPU/memory use, and more. Check it out.
This approach of XML processing gives more control to the client application than to the parser, enabling much faster and more memory-efficient processing. This is becoming a standard across different domains of XML processing. For example, Apache Axis2, one of the prominent SOAP processing engines, improved its performance four times, on average, over its predecessor by using a StAX-based XML processing model called Axiom. Axiom is much more memory-efficient and performant than the existing object models available today, due to the usage of StAX as its XML parsing technology.
S. W. Eran Chinthaka is a pioneering member of Apache Axis2, AXIOM and Synapse projects, working fulltime with WSO2 Inc..
|
|