Skip to main content

Getting Groovy with XML

August 12, 2004

{cs.r.title}










Contents
Java Straight to the DOM
Simplifying
with XPath
Groovy
Take One
Groovy
and the DOM
Wrapping
the XML DOM
Groovy
and Object Orientation
Conclusions
Resources

XML sucks. Oh, wait, XML rocks. Well, it actually does a lot of both. It rocks because of all of the editors, validators, and tools written for it. XML has all but replaced any notion of a new custom text-based data language. But it also sucks because it's hard to use. Using a DOM to read and manipulate XML is a pain, and SAX is even worse. XPath helps a little, but even XSLT, the ultimate XML processing tool, is hard to learn, follows an uncommon functional programming paradigm, and is overkill for small problems.

Is there something that we can do to take the pain out of XML? The E4X committee, which is made up of representatives from a bunch of big companies (Mozilla.org, Microsoft, Macromedia, etc.), seems to think so. They have an extension layer proposal for JavaScript (ECMAScript) that will make building and accessing an XML DOM as easy as working with objects using the "dot" notation.

Here is example XML data file that I will use for all of the examples in this article:

<transactions> 
    <account id="a">
        <transaction amount="500" />
        <transaction amount="1200" />
    </account>
    <account id="b">
        <transaction amount="600" />
        <transaction amount="800" />
        <transaction amount="2000" />
    </account>
</transactions>

Wouldn't it be great if you had this XML document attached to a variable named doc? You could say this:

var id = doc.transactions.account[0].id

And have the id variable set to a. That is what E4X is all about. It's about making XML access simple and easy to understand. The only problem is that E4X hasn't been approved yet, and isn't shipping. So can we make XML simpler today?

When I asked myself that question, I thought of applying my new favorite embedded language, Groovy, to the task. You can judge for yourself how far I have simplified the task, but I hope in the meantime that you will learn something about XML and a lot about Groovy.

What are these fixes? First, we will use a dot notation for traversing the DOM tree, instead of using accessors. We will also default any node access to map to the first child of that node in the tree. This means that you won't have to indicate which child you want to work with if there is only one child. Access will always default to the first matching child. And finally, we will make XPath access simpler through a native method on every node.

In twelve-step programs they have you admit your addiction in order begin to to deal with it. In order to understand how bad using the DOM is, we need to start with a hand-coded example.

Java Straight to the DOM

The example I will use throughout this article is to take the original XML data file and to add up all of the transaction numbers by account. We will call this function calculateAccounts, and it should return a hash or a map that has an entry for each account with the correct values. In this case, that means 1700 for account a and 3400 for account b.

The simplest way to do this would be to use the DOM using Java:


import
java.util.Hashtable;
import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Element;
import org.w3c.dom.Node;

public class GroovyXML1
{
    public static Hashtable calculateAmounts( String fileName )
        throws Exception
    {

We first read in the XML file:

        // Read the XML 
        DocumentBuilderFactory factory =
            DocumentBuilderFactory.newInstance();
        DocumentBuilder builder =
            factory.newDocumentBuilder();
        Document doc = builder.parse( new File( fileName ) );

        // Initialize the list of account values
        Hashtable accountValues = new Hashtable();

Next we iterate through the account nodes:

        // Get the initial account nodes 
        NodeList accountNodes =
            doc.getChildNodes().item(0).getChildNodes();
        for(int accountNodeIndex = 0;
            accountNodeIndex < accountNodes.getLength();
            accountNodeIndex++ )
        {

One of the problems with DOM access is that we have to account for the white-space nodes that are in the tree. This conditional ensures that we only look at the element nodes.

            // Go only through the account Element nodes 
            if ( accountNodes.item(accountNodeIndex).getNodeType() ==
                                    Node.ELEMENT_NODE )
            {

Because Node doesn't have a convenient accessor to get attributes, we need to cast the Node to an Element before we can get the account id.

              Element accountElement =
                  (Element)accountNodes.item( accountNodeIndex);
              // Get the account ID
              String accountID =
                  accountElement.getAttribute( "id" );

Now we need to iterate through the transaction nodes to add up the amounts.

              // Go through the transaction nodes
              // within the account node
              int amount = 0;
              NodeList transactionNodes =
                  accountElement.getChildNodes();

              for( int transIndex = 0;
                   transIndex < transactionNodes.getLength();
                   transIndex++ )
              {
                 // Go through just the elements
                 if ( transactionNodes.item( transIndex ).getNodeType() ==
                            Node.ELEMENT_NODE )
                 {
                    // Add the amount to the amount counter
                    Element transaction =
                        (Element)transactionNodes.item( transIndex );
                    Integer value =
                        new Integer( transaction.getAttribute( "amount" ) );
                    amount += value.intValue();
                 }
              }

And the final step in the processing is to add the amount to the hash table. Because hash tables only take objects, we need to wrap the total in an Integer object before we can add it to the output.

              // Add the account total to the hash table 
              accountValues.put( accountID, new Integer( amount ) );
           }
        }

        return accountValues;
    }

With the results in hand, we can output the results to see if we did our math correctly.

    public static void main( String[] args)
        throws Exception
    {
      System.out.println( "Using XML DOM" );
      Hashtable out = calculateAmounts( "test_data.xml" );
      System.out.println( "a = " + out.get( "a") );
      System.out.println( "b = " + out.get( "b") );
    }
}

Of the twenty-odd lines that were involved in getting the results, only two of those were the actual algorithm itself. So it's no surprise when we can't actually see the algorithm forest for all of the infrastructure trees.

Perhaps things would be better if we used XPath.







Simplifying with XPath

We need to bring in the XPath API:

import org.apache.xpath.XPathAPI;

Then we can make some changes to the calculateAmounts method:

public static Hashtable calculateAmounts( String fileName )
  throws Exception
  {
  // Read the XML
  DocumentBuilderFactory factory =
    DocumentBuilderFactory.newInstance();
  DocumentBuilder builder =
    factory.newDocumentBuilder();
  Document doc = builder.parse( new File( fileName ) );
 
  // Initialize the list of account values
  Hashtable accountValues = new Hashtable();

We make some improvements in the robustness of the code, because XPath now handles finding the account elements:

// Get the initial account nodes 
NodeList accountNodes =
  XPathAPI.selectNodeList( doc, "/transactions/account");

for( int accountNodeIndex = 0;
  accountNodeIndex < accountNodes.getLength();
  accountNodeIndex++ )
  {
  Element accountElement =
    (Element)accountNodes.item( accountNodeIndex );
  // Get the account ID
  String accountID =
    accountElement.getAttribute( "id" );

Note that we didn't have to use the conditional to check to see whether we were looking at elements, because with XPath guarantees that we are only looking at elements.

We can also replace the amount fetcher with an XPath search from the account node down to get the amounts:

// Go through the transaction nodes within the account node 
int amount = 0;
NodeList amountNodes =
  XPathAPI.selectNodeList( accountElement,
  "transaction/@amount" );


  for( int amountIndex = 0;
    amountIndex < amountNodes.getLength();
    amountIndex++ )
     {
     // Add the amount to the amount counter
     nodeValue = amountNodes.item( amountIndex ).getNodeValue();
     amount += Integer.valueOf( nodeValue ).intValue();
   }

Notice that we are getting just the amount attributes by using the @amount specifier in the XPath. The XPath code is more robust than the original DOM code because it ensures that we are only looking at transaction elements within the account node. The original code would look at any type of node to find an amount attribute. If we added new types of nodes to the account node, we would be in trouble.

// Add the account total to the hash table 
accountValues.put( accountID,new Integer( amount ) );
{
return accountValues;
}

We have taken about five lines out of the code with XPath and made the whole processing system more robust, but the algorithm is still pretty obscured. Can we make it any cleaner using Groovy? I thought you'd never ask.

Groovy Take One

In the spirit of XP and refactoring, I'll keep changing the code until I get the right blend of algorithm and infrastructure. To use Groovy in this process, the first step is to move the calculateAmount method into the Groovy engine, as illustrated in Figure 1:

Figure 1. Moving the logic into Groovy
Figure 1. Moving the logic into Groovy

The application will create a Groovy scripting shell instance, load it up with our script, and then run a calculateAmounts closure. The Java for this starts here:

import groovy.lang.GroovyShell;
import groovy.lang.Binding;
import groovy.lang.Closure;
import java.io.File;
import java.util.Map;

public class GroovyXML3
{
    public static Map calculateAmounts( String fileName ) throws Exception
    {
        GroovyShell shell =
            new GroovyShell( new Binding() );
        shell.evaluate( new File( "GroovyXML3.groovy" ) );

Here we create the Groovy shell and load the script.

        // Get the calculateAmounts closure 
        Closure calc =
            (Closure)shell.getVariable( "calculateAmounts" );

We then get the calculateAmounts closure.

        // Run the closure on the file name 
        return (Map)calc.call( fileName );
    }

We call that closure with the file name string, and coerce the return value back to a Map. Why the change from a Hashtable to a Map? Because Groovy's native datatype for an associate array (hash table) is a Map and not a java.util.Hashtable.

    public static void main( String[] args) throwsException
    {
        System.out.println( "Using Groovy with a file name" );
        Map out = calculateAmounts( "test_data.xml" );
        System.out.println( "a = " + out.get( "a") );
        System.out.println( "b = " + out.get( "b") );
    }
}

The corresponding Groovy is:

import java.io.File; 
import javax.xml.parsers.DocumentBuilderFactory;
import org.apache.xpath.XPathAPI;

Yep, Groovy can directly import Java namespaces.

calculateAmounts = { fileName | 

Here is where we define the calculateAmounts closure. A closure is a lot like a function. In fact, for the purposes of this article, you can think of it as a function that takes arguments. In this case, the argument is a file name. If you are curious as to what closures are, you might want to read up on functional programming languages such as Haskell, or languages such as LISP or Scheme.

The rest of the Groovy code looks like our original Java algorithm, with the exception that there are no types and that for loops are different:

    factory = DocumentBuilderFactory.newInstance(); 
    builder = factory.newDocumentBuilder();
    doc = builder.parse( new File( fileName ) );
   
    accountValues = [:];
    accountNodes =
        XPathAPI.selectNodeList( doc, "/transactions/account");
    for( accountNodeIndex in 0..(accountNodes.getLength()-1) )
    {

The for loop in Groovy has several different variations. I'm using the variation that iterates over a range of values. In this case, from zero to the number of nodes in the list minus one.

        accountID =
            accountNodes.item( accountNodeIndex ).getAttribute( "id" );
        amount = 0;
        amountNodes = XPathAPI.selectNodeList(
        accountNodes.item( accountNodeIndex ),
             "transaction/@amount" );
        for( amountIndex in 0..(amountNodes.getLength()-1) )
        {
            nodeValue = amountNodes.item( amountIndex ).getNodeValue();
            amount += Integer.valueOf( nodeValue ).intValue();
        }
        accountValues.put( accountID, amount );

Another nice thing about Groovy is that I don't have to handle the int to Integer conversions. I can just concentrate on the math.

    } 
       
    return accountValues;
};

We're down to about 13 lines, with two being the algorithm. Better, but still not very good.

One problem that I have had with my approach since the beginning is that the calculateAmounts function handles the reading of the XML data. That doesn't make a lot of sense.







Groovy and the DOM

Can we pass a DOM object into Groovy, as Figure 2 shows?

Figure 2. Injecting the DOM into Groovy
Figure 2. Injecting the DOM into Groovy

Let's make some changes to the Java file to have it do the DOM reading and then pass it to Groovy:

    public static Map calculateAmounts( Document doc )
    throwsException
    {
        GroovyShell shell = new GroovyShell( new Binding() );
        shell.evaluate( new File( "GroovyXML4.groovy" ) );

        // Get the calculateAmounts closure
        Closure calc =
            (Closure)shell.getVariable( "calculateAmounts" );

        // Run the closure with the document node
        return (Map)calc.call( doc );
    }

We can call the closure with the Document, just as we did with the file name.

    public static void main( String[] args) throws Exception
    {
        System.out.println( "Using Groovy with a DOM" );

        // Read the XML
        DocumentBuilderFactory factory =
            DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        Document doc = builder.parse( new File( "test_data.xml" ) );
        Map out = calculateAmounts( doc );
        System.out.println( "a = " + out.get( "a") );
        System.out.println( "b = " + out.get( "b") );
    }
}

Here is the updated Groovy code that no longer does the DOM-reading work.

import org.apache.xpath.XPathAPI;

calculateAmounts = { doc |
    accountValues = [:];
    accountNodes =
        XPathAPI.selectNodeList( doc, "/transactions/account");
    for( accountNodeIndex in 0..(accountNodes.getLength()-1) )
    {
        accountID =
            accountNodes.item( accountNodeIndex ).getAttribute( "id" );
        amount = 0;
        amountNodes =
            XPathAPI.selectNodeList( accountNodes.item( accountNodeIndex ),
                                     "transaction/@amount" );
        for( amountIndex in 0..(amountNodes.getLength()-1) )
        {
            nodeValue = amountNodes.item( amountIndex ).getNodeValue();
            amount += Integer.valueOf( nodeValue ).intValue();
        }
        accountValues.put( accountID, amount );
    }

    return accountValues;
};

This small change gets us down to 10 lines of code and two lines of algorithm. That's starting to get into the reasonable range. But at the beginning of the article, I talked about creating an easier non-DOM syntax for reading XML.

Wrapping the XML DOM

I wonder if we could wrap the DOM nodes in something that would make them a little easier to use, as in Figure 3:

Figure 3. Wrapping the DOM with our own proxy object
Figure 3. Wrapping the DOM with our own proxy object


It would be great if we could access the node attributes just as we do properties, like this:

accountNode.id

Then we could execute an XPath search on a node by just calling an XPath method on that node. Let's start by creating a subclass of GroovyObject that will wrap a DOM node:

import groovy.lang.GroovyObject; 
import groovy.lang.MetaClass;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.apache.xpath.XPathAPI;
import java.util.Vector;

public class DOMNodeGroovyObject implements GroovyObject
{
    private Node _node;

Our constructor will take a reference to the node:

    DOMNodeGroovyObject( Node node ) { _node = node; } 

Then we will listen for method invocations such as getValue, which will return the text of the node, and xpath, which will run an XPath query from this node down.

    public Object invokeMethod( String arg0, Object arg1) 
    {
        if ( arg0 == "getValue" )
            return _node.getNodeValue();
       
        if ( arg0 == "xpath" )
            return xpath( (( Object[])arg1)[0].toString() );
               
        return null;
    }

We will also override getProperty to return the value of any attributes on the node we are wrapping.

    public Object getProperty( String arg0) 
    {
        if ( _node.getNodeType() == Node.ELEMENT_NODE )
        {
            Element elem = (Element)_node;
            if ( elem.hasAttribute( arg0 ) )
                return elem.getAttribute( arg0 );
        }
        return null;
    }
   
    public void setProperty( String arg0, Object arg1) { }
   
    public MetaClass getMetaClass() { return null; }
   
    public void setMetaClass(MetaClass arg0) { }

The xpath method will return a Vector of resulting nodes, each of which will be wrapped with our class instead of being just basic Nodes.

    private Vector xpath( String path ) 
    {
        Vector children = newVector();
        try {
            NodeList nodes = XPathAPI.selectNodeList(_node,path );
            for( int child = 0; child < nodes.getLength(); child++ )
            {
                Node node = nodes.item( child );
                children.add( new DOMNodeGroovyObject( node ) );
            }
        } catch( Exception e ) {
        }
        return children;
    }
}

In order to hook it all up, we need to make a small change to the original Java file to pass the closure a Groovy DOM node and not just a DOM node:

        // Run the closure with the dom node wrapper
        return (Map)calc.call( new DOMNodeGroovyObject( doc ) );
    }

Now the Groovy looks like this:

calculateAmounts = { doc |
    accountValues = [:];
    accountNodes = doc.xpath( "/transactions/account" );
    for( accountNode in accountNodes )
    {
        amount = 0;
        amountNodes = accountNode.xpath( "transaction/@amount" );
        for( amountNode in amountNodes )
        {
            amount +=
                Integer.valueOf( amountNode.getValue() ).intValue();
        }
        accountValues.put( accountNode.id, amount );
    }
       
    return accountValues;
};

Wow! This code is almost becoming readable. We are down to nine lines, two of which are the algorithm. We don't have any Java imports. We are using an xpath method on the node, which makes more sense than calling XPath directly. We can also use the very safe version of for that iterates over a vector. So no more "minus one" stuff on our for loops.

But this procedural stuff -- it doesn't feel right. We are doing the Java thing right. We should be object-oriented!







Groovy and Object Orientation

Can we make a Groovy object and access it with Java, as in Figure 4?

Figure 4. Referencing a groovy class directly
Figure 4. Referencing a Groovy class directly

How cool would that be? The first step would be to make an interface to which the Groovy code must conform:

public interface GroovyNodeIterator
{
    public Object process( DOMNodeGroovyObject doc );
}

Our simple interface has one method, which takes a DOM wrapper object and returns some sort of object. With this in hand, we can make some final changes to the Java code:

import groovy.lang.GroovyClassLoader; 
...
    public static Map calculateAmounts( Document doc ) throws Exception
    {
        // Create a class loader
        GroovyClassLoader groovyLoader =
            new GroovyClassLoader();

        // Create the shell
        GroovyShell shell =
            new GroovyShell( groovyLoader, new Binding() );

We need to build a GroovyClassLoader, because we will be asking it for one of our Groovy classes. We pass this new class loader to the shell.

        // Load the AmountAdder Groovy class 
        Class adderClass =
            groovyLoader.loadClass( "AmountAdder" );

We then ask the class loader to load up our AmountAdder class.

        // Create an instance of it 
        GroovyNodeIterator obj =
            (GroovyNodeIterator)adderClass.newInstance();

And we create an instance of that class. No kidding. Seriously. We are creating an object that looks and feels just like a Java object, but it's really been written in Groovy.

        // Run the process method 
        return (Map)obj.process( new DOMNodeGroovyObject( doc ) );
    }

We then call the process method and get the return value. We can call this Groovy object just as we would any Java object.

The Groovy code for the AmountAdder class is shown below:

class AmountAdder implements GroovyNodeIterator 
{
    public Object process( DOMNodeGroovyObject doc )
    {
        accountValues = [:];
        for( accountNode in doc.xpath( "/transactions/account" ) )
        {
            accountValues[ accountNode.id ] = 0;
            for( transNode in accountNode.xpath( "transaction" ) )
            {
                accountValues[ accountNode.id ] +=
                    Integer.valueOf( transNode.amount );
            }
        }
        return accountValues;
    }
}

The process method has to be defined exactly as it would be in Java; otherwise, we will get an AbstractMethod exception, because the class will not have implemented everything in the interface.

Inside of the method, we have simplified the algorithm as much as we can. It's down to six lines, two of which are the algorithm. That's 33 percent, which is really good. Certainly, the logic of the algorithm is now more visible than it was before.

But the most important thing we have learned is that Groovy is really cool. It can extend interfaces and creates byte code that can be executed by Java directly. Which means that you can have a reference to a Groovy object that looks and works exactly like a Java object.

Conclusions

XML is pretty cool stuff, and the more we use it, the more we need to think about how to make it easier to use. We need to think about ways to reduce the XML infrastructure work for ourselves and our customers.

Now, I don't seriously think that people will start replacing XSLT with Groovy code. But I do think that this article should spur your imagination with ideas about how you can use a flexible scripting language like Groovy to simplify your application and to make it more extensible. When integrating Groovy into your application is as simple as creating an interpreter and asking the class loader for a class, you have to think about all of the things you can do in the script layer as part of rapid prototyping, things that can then be brought back into the Java layer for efficiency.

Adding dynamic scripting-language support to your application is at your fingertips. Give it a go!

Resources

Jack Herrington is a software engineer with over twenty years of experience on numerous platforms and languages.
Related Topics >> Programming   |   Web Services and XML   |