The Source for Java Technology Collaboration
User: Password:



   

From Writing Programs to Creating Compilers From Writing Programs to Creating Compilers

by Sanjay Dasgupta
10/06/2004


Contents
Start Here: A Readymade Grammar for Java
Modifying the Java Grammar
Producing Output
Going Deeper, Changing ClassDeclaration
Finally, the AJ Specialty: TaskDeclaration
We're Ready to Test!
Life Beyond AJ
References

In this article we build a simple compiler that augments Java with tasks (independent blocks of code that execute in parallel), thus creating a new language called AJ that well supports the programming of systems with concurrent activities. This is not to imply that Java does not support multithreading, but to create a syntax that enables the clear expression of concurrency requirements. (AJ programs do, in fact, compile into Java programs that use java.lang.Thread objects to support tasks.) You will see how to use a completely visual, user-friendly parser generator, VisualLangLab, to quite easily build a compiler for AJ.

Start Here: A Readymade Grammar for Java

VisualLangLab is a freely downloadable (GPL) java.net project. To install it, just unzip the distribution file (VisualLangLab.zip) into a directory. The visual parser generator (see Figure 1) is started by clicking on VisualLangLab.jar. The only other thing to keep in mind is that VisualLangLab.jar must be on the classpath when compiling or executing the generated parser.


Figure 1. The VisualLangLab GUI
Figure 1. The VisualLangLab GUI

The heart of any compiler project is the grammar--a typically cryptic piece of text that serves as a specification for the language. Since we wish to augment the capabilities of an existing language (Java), it is simpler to start with an existing grammar for the base language. The starting point for this exercise will therefore be a Java 1.4 grammar included in the VisualLangLab distribution (the file java14C.lkv). Well-written grammars, like well-written programs, are divided into smaller units called rules. A few sample rules from the Java 1.4 grammar (as displayed by the VisualLangLab GUI) are shown in Figure 2 (IfStatement), Figure 3 (SwitchStatement) and Figure 4 (UnaryExpression).

Figure 2. Rule IfStatement in Java 1.4 grammar
Figure 2. The rule IfStatement in Java 1.4 grammar

Figure 3. The rule <code>SwitchStatement</code> in Java 1.4 grammar
Figure 3. The rule SwitchStatement in Java 1.4 grammar

Figure 4. The rule <code>UnaryExpression</code> in Java 1.4 grammar
Figure 4. The rule UnaryExpression in Java 1.4 grammar

Note that unlike other parser generator tools, a VisualLangLab grammar is visual (trees with intuitive icons that can be viewed and edited using interactive menus and commands). The .lkv files produced and used by VisualLangLab are not text files, and therefore not human-readable. These files contain everything: token definitions, parser rules, and any semantic actions defined in the grammar. A grammar file (such as java14C.lkv) can be opened using File -> Open. When a grammar file is opened, the top-level rule (or start symbol) is displayed first (see Figure 5). The combo box at the top of the tree display (or the Goto item on the grammar tree's pop-up menu) can be used to display other rules. Refer to the documentation (docs/index.html in the distribution kit) for more details.

Modifying the Java Grammar

Readers unfamiliar with the terms and technology used here can find help in the books cited below, but most should be able to follow the discussion even without a deep background in compiler technology. Running VisualLangLab and actually performing the steps described will help. The VisualLangLab GUI allows readers to find their way around quite easily. Obviously, having the documentation (docs/index.html in the distribution kit) around also helps.

All grammars have a start symbol (equivalent to a Java application's main function) that is the entry point to the grammar. The Java 1.4 grammar's start symbol is called CompilationUnit, and appears as shown in Figure 5. As you can see, the start symbol's root node is distinguished by rendering it on a green background.

Figure 5. CompilationUnit (the Java 1.4 start symbol)
Figure 5. CompilationUnit (the Java 1.4 start symbol)

Before we move on, we must handle a fundamental issue--the grammar in java14C.lkv has no semantic actions (application-specific code to be executed when a rule or rule element matches the input). This means that a parser created from it will verify that its input complies with the grammar, but do nothing else (not even produce any output). So the first change we will make is to add semantic actions that output the required generated code.

Our approach to generating output will be based on the two following guiding principles:

  • Identify sub-trees (at high levels of the grammar) that will remain unchanged in AJ. Since the AJ compiler's generated code is plain (sans-"task") Java, the generated code for such sub-trees can be produced by a semantic action that merely outputs a copy of the part of the input text that matches the sub-tree.
  • Identify sub-trees (at high levels) that will change in AJ. These sub-trees are left without semantic actions--allowing code-generation responsibility to devolve (recursively) to the lowest possible levels in the rule tree. Custom code-generating logic (as semantic actions associated with the rule elements) is added to suitable sub-trees at the lower levels of the grammar tree to translate AJ constructs into the required plain Java code.

Further, to keep things simple, all generated-code is just written out to the standard output (java.lang.System.out). Command-line I/O redirection, as described below, can be used to get the generated code into a file.

Producing Output

Before we begin, let's discuss the apex rule (CompilationUnit depicted in Figure 5) in more detail. The locomotive icon at the root indicates that the child nodes are a sequence (they must follow each other in the given order). The blue arrows (used for PackageDeclaration, ImportDeclaration, and TypeDeclaration) represent references to other grammar rules, and the orange slotted disk (EOF) represents a token (an atomic piece of the input (e.g., if, 3.25, while, and +=, which a parser never has to break down any further). EOF is a pseudo-token representing the end of the input stream. What this means is that a unit of Java (a file containing Java source) contains an optional (notice the ?) PackageDeclaration, followed by 0 or more (notice the *) ImportDeclarations, followed by 0 or more TypeDeclarations.

Since the grammars for PackageDeclaration and ImportDeclaration remain unchanged in AJ, we can produce the required output by merely copying the input to the output. (Remember, the AJ compiler's generated code is in plain Java.) A semantic action to copy the text of any PackageDeclaration to the output is shown in Figure 6. (Semantic actions are snippets of Java mixed with special separators and macros entered in the Code tab below the grammar tree.) Semantic actions can be associated with each node of a grammar rule, and are executed when the input text matches the pattern specified by the associated node.

Figure 6. Semantic actions for PackageDeclaration reference
Figure 6. Semantic actions for PackageDeclaration reference

To understand how this works, we need to know the function of %%, $B$ and $Z$. Each %% is a separator, thus splitting the code into up to four sections (of which we only use the second and fourth here). The second section (int begin = $B$;) is executed (by the generated parser) when the first element of a PackageDeclaration (the keyword package, in this case) is found, and the fourth section (int end = $B$; System.out.print( $Z$.substring( begin, end ) );) is executed after the last element of the declaration (the ; at the end of the package declaration, in this case) has been found and processed. Each $B$ translates into a source-code offset at that point (so the variable begin will be initialized to the position of the word "package" in the source, while the variable end will be initialized to the beginning of whatever token follows the ; at the end of the package declaration). The $Z$ represents a String containing the entire source code. It is therefore obvious that the println will output all of the source code of the package declaration (including any white space and comments preceding the next token in the input stream).

The following additional remarks also apply to CompilationUnit (see Figure 6). The same copy verbatim semantic action (described in the preceding paragraph) should also be used for the ImportDeclaration, reference as the grammar of that rule also does not change in AJ. However, we let the reference to TypeDeclaration remain without any semantic action, to be handled at lower levels of the grammar rule tree (where AJ differs from Java).

Pages: 1, 2, 3

Next Page » 

View all java.net Articles.

 Feed java.net RSS Feeds