Skip to main content

From Writing Programs to Creating Compilers

October 6, 2004

{cs.r.title}












Contents
Start Here: A Readymade Grammar for Java
Modifying the Java Grammar
Producing Output
Going
Deeper, Changing ClassDeclaration
Finally,
the AJ Specialty: TaskDeclaration
We're
Ready to Test!
Life
Beyond AJ
References

In this article we build a simple compiler that augments Java
with tasks (independent blocks of code that execute in
parallel), thus creating a new language called AJ that well supports
the programming of systems with concurrent activities.
This is not to imply that Java does not support multithreading, but
to create a syntax that enables the clear expression of concurrency
requirements. (AJ programs do, in fact, compile into Java programs
that use java.lang.Thread objects to support tasks.)
You will see how to use a completely visual, user-friendly parser
generator, VisualLangLab, to quite easily build a compiler for AJ.

Start Here: A Readymade Grammar for Java

VisualLangLab is a freely downloadable (GPL) java.net project. To install it, just
unzip the distribution file (VisualLangLab.zip) into a
directory. The visual parser generator (see Figure 1)
is started by clicking on VisualLangLab.jar.
The only other thing to keep in mind is that
VisualLangLab.jar must be on the classpath when compiling or
executing the generated parser.


Figure 1. The VisualLangLab GUI

Figure 1. The VisualLangLab GUI

The heart of any compiler project is the grammar--a typically
cryptic piece of text that serves as a specification for the language.
Since we wish to augment the capabilities of an existing language (Java),
it is simpler to start with an existing grammar for the base language.
The starting point for this exercise will therefore be a Java 1.4
grammar included in the VisualLangLab distribution (the file
java14C.lkv). Well-written grammars, like well-written programs,
are divided into smaller units called rules. A few sample rules
from the Java 1.4 grammar (as displayed by the VisualLangLab GUI) are
shown in Figure 2 (IfStatement), Figure 3 (SwitchStatement)
and Figure 4 (UnaryExpression).

Figure 2. Rule IfStatement in Java 1.4 grammar

Figure 2. The rule IfStatement in Java 1.4 grammar

Figure 3. The rule <code class=SwitchStatement in Java 1.4 grammar" width="208" height="199"/>

Figure 3. The rule SwitchStatement in Java 1.4 grammar

Figure 4. The rule <code class=UnaryExpression in Java 1.4 grammar" width="207" height="201"/>

Figure 4. The rule UnaryExpression in Java 1.4 grammar

Note that unlike other parser generator tools, a VisualLangLab grammar
is visual (trees with intuitive icons that can be
viewed and edited using interactive menus and
commands). The .lkv files produced and used by VisualLangLab are not text
files, and therefore not human-readable. These files contain
everything: token definitions, parser rules, and any
semantic actions defined in the grammar. A grammar file (such as
java14C.lkv) can be opened using File -> Open.
When a grammar file is opened, the top-level rule (or start symbol)
is displayed first (see Figure 5). The combo box at the top of the tree display
(or the Goto item on the grammar tree's pop-up menu) can be
used to display other rules. Refer to the documentation (docs/index.html
in the distribution kit) for more details.

Modifying the Java Grammar

Readers unfamiliar with the terms and technology used here can find
help in the books cited below, but most should be able to follow the
discussion even without a deep background in compiler technology.
Running VisualLangLab and actually performing the steps described will help.
The VisualLangLab GUI allows readers to find
their way around quite easily. Obviously, having the documentation
(docs/index.html in the distribution kit) around also helps.

All grammars have a start symbol (equivalent to a Java application's
main function) that is the entry point to the grammar. The Java 1.4
grammar's start symbol is called CompilationUnit, and appears as
shown in Figure 5. As you can see, the start symbol's root node is
distinguished by rendering it on a green background.

Figure 5. CompilationUnit (the Java 1.4 start symbol)

Figure 5. CompilationUnit (the Java 1.4 start symbol)

Before we move on, we must handle a fundamental issue--the grammar in
java14C.lkv has no semantic actions (application-specific code to be
executed when a rule or rule element matches the input).
This means that a parser created from it will verify that its input
complies with the grammar, but do nothing else (not even produce any
output). So the first change we will make is to add semantic actions
that output the required generated code.

Our approach to generating output will be based on the two following guiding
principles:

  • Identify sub-trees (at high levels of the grammar) that will remain
    unchanged in AJ. Since the AJ compiler's generated code
    is plain (sans-"task") Java, the generated code for
    such sub-trees can be produced by a semantic action that merely outputs
    a copy of the part of the input text that matches the sub-tree.
  • Identify sub-trees (at high levels) that will change in AJ.
    These sub-trees are left without semantic actions--allowing
    code-generation responsibility to devolve (recursively) to
    the lowest possible levels in the rule tree.
    Custom code-generating logic (as semantic actions associated with
    the rule elements) is added to suitable sub-trees at the
    lower levels of the grammar tree to translate AJ constructs
    into the required plain Java code.

Further, to keep things simple, all generated-code is just written out to
the standard output (java.lang.System.out). Command-line I/O redirection,
as described below, can be used to get the generated code into a file.

Producing Output

Before we begin, let's discuss the apex rule (CompilationUnit
depicted in Figure 5) in more
detail. The locomotive icon at the root indicates that the child nodes are
a sequence (they must follow each other in the given order). The blue arrows
(used for PackageDeclaration, ImportDeclaration, and
TypeDeclaration) represent references to other grammar rules, and
the orange slotted disk (EOF) represents a token (an atomic
piece of the input (e.g., if, 3.25,
while, and +=, which a parser never has to break
down any further). EOF is a pseudo-token representing the end of
the input stream. What this means is that a unit of Java (a file
containing Java source) contains an optional (notice the ?)
PackageDeclaration, followed by 0 or more (notice the *)
ImportDeclarations, followed by 0 or more TypeDeclarations.

Since the grammars for PackageDeclaration and
ImportDeclaration remain unchanged in AJ, we can produce
the required output by merely copying the input to the output.
(Remember, the AJ compiler's generated code is
in plain Java.) A semantic action to copy the text of any PackageDeclaration
to the output is shown in Figure 6. (Semantic actions are snippets of
Java mixed with special separators and macros entered in the Code
tab below the grammar tree.) Semantic actions can be associated with
each node of a grammar rule, and are executed when the input text matches
the pattern specified by the associated node.

Figure 6. Semantic actions for PackageDeclaration reference

Figure 6. Semantic actions for PackageDeclaration reference

To understand how this works, we need to know the function of %%, $B$
and $Z$. Each %% is a separator, thus splitting the code into
up to four sections (of which we only use the second and fourth here).
The second section (int begin = $B$;) is executed (by the generated
parser) when the first element of a PackageDeclaration (the keyword
package, in this case) is found, and the fourth section
(int end = $B$; System.out.print( $Z$.substring( begin, end ) );)
is executed after the last
element of the declaration (the ; at the end of the package declaration,
in this case) has been found and processed. Each $B$ translates into
a source-code offset at that point (so the variable begin will be
initialized to the position of the word "package" in the source, while
the variable end will be initialized to the beginning of whatever
token follows the ; at the end of the package declaration). The
$Z$ represents a String containing the entire source code.
It is therefore obvious that the println will output all of
the source code of the package declaration (including any white space and
comments preceding the next token in the input stream).

The following additional remarks also apply to CompilationUnit
(see Figure 6). The same copy verbatim semantic action
(described in the preceding paragraph) should also be used for
the ImportDeclaration, reference as the grammar of that rule
also does not change in AJ. However, we let the reference to
TypeDeclaration remain without any semantic action, to be
handled at lower levels of the grammar rule tree (where AJ
differs from Java).







Going Deeper, Changing ClassDeclaration

Having completed CompilationUnit, we move on to TypeDeclaration,
which is depicted in Figure 7. The InterfaceDeclaration
reference and the token SEMI will remain unchanged in AJ, and
so are also equipped with the same copy verbatim semantic action
described above. Again, we leave alone the ClassDeclaration
reference, to be handled at still deeper levels of the grammar tree.

Figure 7. TypeDeclaration (unchanged from java14C.lkv)
Figure 7. TypeDeclaration (unchanged from java14C.lkv)

The special locomotive symbol (with a cyan disk superimposed)
represents a syntactic lookahead. The first child node of such a
sequence is a rule that exists only to check the structure of the input
text at that point--and is required to disambiguate
the grammar when LL(k) logic is not strong enough.
An important thing to note about such lookahead rules is that neither the
reference itself nor any element within such a rule should carry a
semantic predicate.

Moving on to ClassDeclaration (shown in Figure 8),
we see that this is where the AJ grammar starts to differ from plain
Java. For reasons the reader can easily discover by probing around in
the java14C.lkv grammar (and this is left as an exercise), the
ClassDeclaration rule for AJ is most effective when revised
to that shown in Figure 9.

Figure 8. Original ClassDeclaration (as in java14C.lkv)
Figure 8. Original ClassDeclaration (as in java14C.lkv)

Figure 9. New ClassDeclaration (modified for AJ)
Figure 9. New ClassDeclaration (modified for AJ)

The semantic actions for this rule are spread out in several different
locations (the first of which is the rule's root node, as depicted in
Figure 9). The second piece is associated with the token
LCURLY, as shown in Figure 10. (Observe that the tabbed
pane below the rule-tree displays information about the selected node.)

Figure 10. Semantic action for LCURLY in ClassDeclaration
Figure 10. Semantic action for LCURLY in ClassDeclaration

The two semantic actions described above (associated with
the root node of ClassDeclaration and the LCURLY node
respectively) result in the copying out of everything from the beginning of
any ClassDeclaration up to the { preceding the member declarations
(since this part of the Java specification is unchanged in AJ).
The third piece of ClassDeclaration's semantic action
is associated with the ClassBodyDeclaration reference, as depicted in
Figure 11.

Figure 11. Semantic action for ClassBodyDeclaration reference
Figure 11. Semantic action for ClassBodyDeclaration reference

Note that this code is our old faithful copy verbatim semantic action.
We can use this semantic action here because the rule ClassBodyDeclaration
(inherited from java14C.lkv) describes syntactic
structures in Java that remain valid in AJ (and source code matching
that rule can therefore be copied over unchanged).

Another interesting thing to note in ClassBodyDeclaration
(Figure 11) is the reference to TaskDeclaration.
As will be explained below, this rule specifies the structure
of a task, and represents our new addition to the Java grammar.
The two references, ClassBodyDeclaration and
TaskDeclaration, have been combined using a choice element,
so, in effect, the task construct adds on to the structural
options permitted by ClassBodyDeclaration at this point
in the grammar for ClassDeclaration. (Check out the
definition of ClassBodyDeclaration to understand how this works.)







Finally, the AJ Specialty: TaskDeclaration

So, we finally reach TaskDeclaration--our very own addition to AJ.
In the discussion of ClassDeclaration above, we mentioned that the
reference to TaskDeclaration is not equipped with
any semantic action--thus relegating responsibility for code generation
to the rule itself, as described below.

The semantic action for TaskDeclaration is spread over four
locations, as described below. The first location is the rule's root node itself
(as in Figure 12), and the code here helps to define required variables.

Figure 12. Semantic action for TaskDeclaration root node
Figure 12. Semantic action for TaskDeclaration root node

The second location is the node IDENTIFIER (as in Figure 13),
and that code helps to capture the name of the task being defined. (The
$$, used in Figures 13 and 14, translates into
a type-safe value notation for the current token.)

Figure 13. Semantic action for IDENTIFIER node of TaskDeclaration
Figure 13. Semantic action for the IDENTIFIER node of TaskDeclaration

The third location is the node INT (as in Figure 14), and the
associated code helps to capture the priority value.

Figure 14. Semantic action for INT node of TaskDeclaration
Figure 14. Semantic action for the INT node of TaskDeclaration

The fourth (and last) location is the node BlockStatement, as depicted in
Figure 15.
The code in this semantic action uses the information gathered by the preceding
sibling nodes, and uses all of it to instantiate a java.lang.Thread object
(with the task's name) to run the code in the BlockStatement within its
run() method.

Figure 15. Semantic action for BlockStatement node of TaskDeclaration
Figure 15. Semantic action for the BlockStatement node of TaskDeclaration

Careful application of the principles described above (the function and use of %%,
$B$, and $Z$) will help you understand the text transformation performed by
this semantic action. The following two code snippets illustrate the transformation
of task blocks.

/*
*  Sample input task code
*/
task incrAge priority 5 {
  while (true) {
    age += 10;
    try {
      Thread.sleep( 3000 );
    } catch (Exception e) {}
  }
}
/*
*  Generated (plain Java) output code
*/
private Thread incrAge = new Thread() {
  public void run() {
  Thread.currentThread().setPriority( 5 );
  while (true) {
    age += 10;
    try {
      Thread.sleep( 3000 );
    } catch (Exception e) {}
  }
}

The underlined portions of the output text are generated using information
directly obtained from the AJ source. The line
Thread.currentThread().setPriority( 5 );
is, of course, only generated if a priority value is explicitly provided in
the TaskDeclaration.

We're Ready to Test!

All that remains now is to generate, compile, and test the AJ compiler.
Generate the compiler by selecting the File -> Write code menu choice (which
should produce a file called AJ.java in the same directory as the
grammar file). Then use the Java compiler: javac  AJ.java
(and remember to keep VisualLangLab.jar on the classpath). To make
sure that all's well with the grammar, use the file AJ.lkv from the
demos/AJ directory of the distribution. Finally, to test the compiler,
try to process the following AJ program (file aj00.aj in the
distribution).

/*
*  Sample AJ program (aj00.aj)
*/
public class aj00 {
  task aa {
    for (int i = 0; i < 5; ++i) {
      ++count;
      try {
        Thread.sleep(2000);
      } catch (Exception e) {}
    }
  }
  task b priority 7 {
    for (int i = 0; i < 10; ++i) {
      System.out.println( count );
      try {
        Thread.sleep( 2000 );
      } catch (Exception e) {}
    }
  }
  int count = 0;
  public static void main( String a[] ) {
    aj00 me = new aj00();
    me.aa.start();
    me.b.start();
  }
}

To process the file, use the command line java >aj00.java AJ -f aj00.aj
(with VisualLangLab.jar on the classpath). Note that the output has been
redirected to aj00.java, so that file receives the generated Java code, and
should be compiled (as a Java program:
javac aj00.java). Finally, running the program (using java aj00)
should produce the output shown below.

1
1
2
3
4
5
5
5
5
5

Life Beyond AJ

Java, and other third-generation languages,
are used as building blocks for solutions in a wide range of domains.
Since these languages do not support many domain-specific concepts,
the resulting code sometimes looks low-level, unintuitive, or
even contrived. The approach described here can be used to augment
programming languages (or create new ones) to produce more expressive
(and therefore more elegant, compact, and supportable) code.

For many situations, a simple tool (such as VisualLangLab) may be
adequate by itself. And for situations where it isn't, this article
should have provided a firm basis on which to build the capability
to use more sophisticated tools.

References

  1. Programming Language Processors in Java, David A. Watt and Deryck F. Brown, Pearson Education (2000).
  2. Modern Compiler Implementation in Java, A.W. Appel, Cambridge University Press (1997).
  3. Compilers: Principles, Techniques and Tools, A.V. Aho, R. Sethi, and J.D. Ullman, Addison Wesley (1986).

Sanjay Dasgupta has been using Java for telecom applications since 1996 (after many years of using many different languages in many industries).
Related Topics >> Programming   |