In this article we build a simple compiler that augments Java
with tasks (independent blocks of code that execute in
parallel), thus creating a new language called AJ that well supports
the programming of systems with concurrent activities.
This is not to imply that Java does not support multithreading, but
to create a syntax that enables the clear expression of concurrency
requirements. (AJ programs do, in fact, compile into Java programs
that use java.lang.Thread objects to support tasks.)
You will see how to use a completely visual, user-friendly parser
generator, VisualLangLab, to quite easily build a compiler for AJ.
Start Here: A Readymade Grammar for Java
VisualLangLab is a freely downloadable (GPL) java.net project. To install it, just
unzip the distribution file (VisualLangLab.zip) into a
directory. The visual parser generator (see Figure 1)
is started by clicking on VisualLangLab.jar.
The only other thing to keep in mind is that
VisualLangLab.jar must be on the classpath when compiling or
executing the generated parser.
Figure 1. The VisualLangLab GUI
The heart of any compiler project is the grammar--a typically
cryptic piece of text that serves as a specification for the language.
Since we wish to augment the capabilities of an existing language (Java),
it is simpler to start with an existing grammar for the base language.
The starting point for this exercise will therefore be a Java 1.4
grammar included in the VisualLangLab distribution (the file
java14C.lkv). Well-written grammars, like well-written programs,
are divided into smaller units called rules. A few sample rules
from the Java 1.4 grammar (as displayed by the VisualLangLab GUI) are
shown in Figure 2 (IfStatement), Figure 3 (SwitchStatement)
and Figure 4 (UnaryExpression).
Figure 2. The rule IfStatement in Java 1.4 grammar
Figure 3. The rule SwitchStatement in Java 1.4 grammar
Figure 4. The rule UnaryExpression in Java 1.4 grammar
Note that unlike other parser generator tools, a VisualLangLab grammar
is visual (trees with intuitive icons that can be
viewed and edited using interactive menus and
commands). The .lkv files produced and used by VisualLangLab are not text
files, and therefore not human-readable. These files contain
everything: token definitions, parser rules, and any
semantic actions defined in the grammar. A grammar file (such as
java14C.lkv) can be opened using File -> Open.
When a grammar file is opened, the top-level rule (or start symbol)
is displayed first (see Figure 5). The combo box at the top of the tree display
(or the Goto item on the grammar tree's pop-up menu) can be
used to display other rules. Refer to the documentation (docs/index.html
in the distribution kit) for more details.
Modifying the Java Grammar
Readers unfamiliar with the terms and technology used here can find
help in the books cited below, but most should be able to follow the
discussion even without a deep background in compiler technology.
Running VisualLangLab and actually performing the steps described will help.
The VisualLangLab GUI allows readers to find
their way around quite easily. Obviously, having the documentation
(docs/index.html in the distribution kit) around also helps.
All grammars have a start symbol (equivalent to a Java application's
main function) that is the entry point to the grammar. The Java 1.4
grammar's start symbol is called CompilationUnit, and appears as
shown in Figure 5. As you can see, the start symbol's root node is
distinguished by rendering it on a green background.
Figure 5. CompilationUnit (the Java 1.4 start symbol)
Before we move on, we must handle a fundamental issue--the grammar in
java14C.lkv has no semantic actions (application-specific code to be
executed when a rule or rule element matches the input).
This means that a parser created from it will verify that its input
complies with the grammar, but do nothing else (not even produce any
output). So the first change we will make is to add semantic actions
that output the required generated code.
Our approach to generating output will be based on the two following guiding
principles:
Identify sub-trees (at high levels of the grammar) that will remain
unchanged in AJ. Since the AJ compiler's generated code
is plain (sans-"task") Java, the generated code for
such sub-trees can be produced by a semantic action that merely outputs
a copy of the part of the input text that matches the sub-tree.
Identify sub-trees (at high levels) that will change in AJ.
These sub-trees are left without semantic actions--allowing
code-generation responsibility to devolve (recursively) to
the lowest possible levels in the rule tree.
Custom code-generating logic (as semantic actions associated with
the rule elements) is added to suitable sub-trees at the
lower levels of the grammar tree to translate AJ constructs
into the required plain Java code.
Further, to keep things simple, all generated-code is just written out to
the standard output (java.lang.System.out). Command-line I/O redirection,
as described below, can be used to get the generated code into a file.
Producing Output
Before we begin, let's discuss the apex rule (CompilationUnit
depicted in Figure 5) in more
detail. The locomotive icon at the root indicates that the child nodes are
a sequence (they must follow each other in the given order). The blue arrows
(used for PackageDeclaration, ImportDeclaration, and
TypeDeclaration) represent references to other grammar rules, and
the orange slotted disk (EOF) represents a token (an atomic
piece of the input (e.g., if, 3.25,
while, and +=, which a parser never has to break
down any further). EOF is a pseudo-token representing the end of
the input stream. What this means is that a unit of Java (a file
containing Java source) contains an optional (notice the ?)
PackageDeclaration, followed by 0 or more (notice the *)
ImportDeclarations, followed by 0 or more TypeDeclarations.
Since the grammars for PackageDeclaration and
ImportDeclaration remain unchanged in AJ, we can produce
the required output by merely copying the input to the output.
(Remember, the AJ compiler's generated code is
in plain Java.) A semantic action to copy the text of any PackageDeclaration
to the output is shown in Figure 6. (Semantic actions are snippets of
Java mixed with special separators and macros entered in the Code
tab below the grammar tree.) Semantic actions can be associated with
each node of a grammar rule, and are executed when the input text matches
the pattern specified by the associated node.
Figure 6. Semantic actions for PackageDeclaration reference
To understand how this works, we need to know the function of %%, $B$
and $Z$. Each %% is a separator, thus splitting the code into
up to four sections (of which we only use the second and fourth here).
The second section (int begin = $B$;) is executed (by the generated
parser) when the first element of a PackageDeclaration (the keyword
package, in this case) is found, and the fourth section
(int end = $B$; System.out.print( $Z$.substring( begin, end ) );)
is executed after the last
element of the declaration (the ; at the end of the package declaration,
in this case) has been found and processed. Each $B$ translates into
a source-code offset at that point (so the variable begin will be
initialized to the position of the word "package" in the source, while
the variable end will be initialized to the beginning of whatever
token follows the ; at the end of the package declaration). The
$Z$ represents a String containing the entire source code.
It is therefore obvious that the println will output all of
the source code of the package declaration (including any white space and
comments preceding the next token in the input stream).
The following additional remarks also apply to CompilationUnit
(see Figure 6). The same copy verbatim semantic action
(described in the preceding paragraph) should also be used for
the ImportDeclaration, reference as the grammar of that rule
also does not change in AJ. However, we let the reference to
TypeDeclaration remain without any semantic action, to be
handled at lower levels of the grammar rule tree (where AJ
differs from Java).