Chapter 4. Building simple generators – Code Generation in Action

Chapter 4. Building simple generators

4.1 The code munger generator type
4.2 The inline-code expansion generator model
4.3 The mixed-code generator model
4.4 The partial-class generator model
4.5 The tier generator model
4.6 Generating for various languages
4.7 Summary

In this chapter, we show you how to build simple generators based on the models presented in chapter 2. They provide an introduction to the practical aspect of building generators as well as a starting point for building your own generators. This is an ideal starting point for someone new to code generators. It’s also valuable for those experienced with code generators to see how this book approaches their construction.

Here are some points to keep in mind before delving into the examples:

  • Some of these generators are so simple that they could be accomplished with existing Unix utilities. While those utilities are powerful, as examples they would provide little insight into building code generators, and as such would give you no foundation for understanding the larger code generators presented in the chapters that follow.
  • The Ruby code we present in this chapter is deliberately simple. No functions or classes are specified, and we use only the most basic methods. You should not consider these examples as an indication of what can or cannot be coded in Ruby. Ruby supports the full range of programming models, from simple macros, to function decomposition, to object-oriented programming, to aspect-oriented programming.

4.1. The code munger generator type

Code munging is the simplest of the code generation models and thus serves as the starting point for our introduction to developing code generators. The code munger takes executable code (e.g., C, Java, C++, SQL), parses the file, and produces some useful output. In this section, we explain what a code munger does and show you six different variants on this fundamental generator type.

4.1.1. Uses and example code

Simple as they are, code mungers solve quite a few difficult problems. For instance, you can use them to:

  • Create documentation by reading special comments from a file and creating HTML as output. (An example of this is a JavaDoc.)
  • Catalog strings that will require internationalization.
  • Report on resource identifier usage.
  • Analyze code and report compliance with company standards.
  • Create indices of classes, methods, or functions.
  • Find and catalog global variable declarations.

A code munger takes an input file, often source code, and searches it for patterns. Once it finds those patterns, it creates a set of one or more output files, as figure 4.1 shows.

Figure 4.1. The input and output flow of a code munger

Now let’s take a look at some specific types of mungers.

Code munger 1—parsing C macros

Our first code munger reads C files looking for #define macros. A #define macro in C looks like this:

#define MY_SYMBOL "a value"

For example:

#define PI 3.4159

The munger should find any #define macros and create an output file that contains a line for each symbol, using this format:

[symbol name],[symbol value]

We’ll show four variants of the code. The first reads a single file; the second scans a directory and reads all of the eligible files; the third reads the file contents from standard input; and the fourth reads from a file but uses the language parser toolkit to read the #define values. Figure 4.2 shows the first three variants in block architecture form.

Figure 4.2. The three basic code munger input models, which handle a file, a directory, and standard input

The advantages of each approach are:

  • Reading from a single file —This type of code munger is ideal for integration within a Makefile or build process, since it works in the standard compiler style of one input or one or more output files. This type of I/O flow works with the external tool model as described in appendix D.
  • Reading from a directory —This approach is valuable because the code munger can handle a growing repository of files without the need for code alterations.
  • Reading from standard input —If the munger reads from standard input, then it can be embedded into IDEs. This type of I/O flow works with the filter model described in appendix D.

Here is our input file for the first variant, which reads from a single file:

#include <stdio.h>
#define HELLO "hello"
#define THERE "there"
int main( int argc, char *argv[] )
        printf( "%s %s\n", HELLO, THERE );
        return 0;

The two #define macros define HELLO and THERE. The output file should look like this:


The two symbols and their values are dumped into the file in comma-separated values (CSV) format. The Ruby generator to accomplish this is shown in listing 4.1.

Listing 4.1. Code munger 1, first variant: reading from a single file

  1. This code checks to make sure that you have a value from the command line. If you don’t, then it prints a friendly usage statement and exits the program.
  2. The first call to opens a new file for output. The first parameter is the file-name and the second optional flag is the mode, in this case w (for writing). The default mode is r (for reading). If the file can be opened for writing, then the code block is executed and the file handle is passed to the block.
  3. The second call to opens the input file that was specified on the command line. This time, however, we use the each_line iterator to walk through the file line by line.
  4. This regular expression serves two purposes. First, the expression checks to see if the line is a #define macro. If it isn’t a #define macro, then the next operator is executed and the regular expression goes on to the next line. The second function gets the symbol and value text items from the #define macro if it finds one. The regular expression is broken out in figure 4.3. The expression looks for a full line because the ^ and $ markers are specified. The line should start with #define followed by some white space, followed directly by two segments delimited by white space.
    Figure 4.3. A breakout of the #define macro regular expression

It is a shame that regular expressions are often regarded as line noise. They are extremely powerful once you get a knack for the syntax. An easy way to learn regular expressions is to use your favorite editor to test regular expression search patterns against the file you want to read. Although regular expression syntax is not consistent across all applications and languages, there is usually a great deal of similarity, and it should be easy to convert the regular expression you develop to the Ruby regular expression syntax.

The second code munger variant reads a directory, finds any C files, and scans them for macros. The code is shown in listing 4.2.

Listing 4.2. Code munger 1, second variant: parsing a directory

  1. This code checks to make sure that you have a value from the command line. If you don’t, then it prints a friendly usage statement and exits the program.
  2. The Ruby Dir class is a portable directory access class. The open method opens the directory for reading. The each iterator then walks through every file in the directory and sends the files into the code block one by one.
  3. This code uses a simple regular expression to disqualify any files that do not end with .c. This regular expression is broken out in figure 4.4. The period character in a regular expression represents any character, so when you actually want to specify a period you need to use the [.] syntax. The c string matches “c” and the $ means the end of the line. The expression reads search for .c at the end of the string. When applied to a filename string, this means match any filename that has a .c extension.
    Figure 4.4. A breakout of the .c regular expression

The third variant (listing 4.3) uses the standard input for the contents of the C file and then creates the out.txt file in response.

Listing 4.3. Code munger 1, third variant: reading from standard input

  1. The global $stdin stands for standard input. This object can be used as if it were any file that was opened as read only. The fourth variant (listing 4.4) reads from a file but uses the language parser toolkit to get the macro names.
    Listing 4.4. Code munger 1, fourth variant: reading from a file using the language parser toolkit

Code munger 2—parsing C comments

The second example code munger looks for special comments in a C file. You’ll find this example useful if you want to create a special markup language that can be used in close proximity with the code. Here is our example input file:

// @important Something important
// @important Something very important

int main( int argc, char *argv[] )

         printf( "Hello World\n" );
         return 0;

The special comment format is:

// @important ...

Everything after the // @important prefix is the content of the comment. You can choose whatever terms you like for your comment tags by altering the regular expression in the sample code.

This code munger’s job is to find the comments and then store the contents of the comments in an output file. The output file should look like this:

Something important
Something very important

Listing 4.5 contains the Ruby code you use to implement this.

Listing 4.5. Code munger 2: Looks for specific comments in a C file

  1. This code checks to make sure that you have a value from the command line. If you don’t, then it prints a friendly usage statement and exits the program.
  2. The heart of this code is the regular expression, which is shown broken out in figure 4.5. The caret symbol (^) at the beginning of the expression symbolizes the beginning of the line. The next four characters—the ones that look like a steeplechase—are the two forward slashes of the comment. Because the forward slash is a reserved character, it needs to be escaped, which means putting a backslash before the character. This is how you end up with the steeplechase pattern: \/\/.
    Figure 4.5. A breakout of the @important comment regular expression

    After that, you want to look for one or more whitespace characters and then the @important string. Following that is one or more whitespace characters. Then, use the parentheses to gather together all of the text until the end of the line, which is specified by the $ character.
Code munger 3: pretty printing C as HTML

One very common use of a code munger is the wholesale conversion of the input file into an entirely new format. Our next example does exactly this type of conversion: the input is a C file and the output is HTML. A typical use for this type of converter would be to generate code fragments for documentation or to create standalone pages that are the end point of some code documentation.

Here is our example C file that we would like to turn into HTML:

// A test file for CM2
#include <stdio.h>

// @important Something important
// @important Something very important

int main( int argc, char *argv[] )
        printf( "Hello World\n" );
        return 0;

We don’t want anything too fancy in the HTML. The title should be the name of the file, and the code should be displayed in Courier. In the next version, you can consider doing some syntax coloring. Here is the type of HTML you should expect from the generator:

<font face="Courier">


Figure 4.6 shows the result when viewed in a web browser.

Figure 4.6. The HTML generated for test.c shown in a browser

The Ruby for our HTML-converting code munger is shown in listing 4.6.

Listing 4.6. Code munger 3: Converts a C file to HTML

  1. The first set of regular expression substitutions scans the complete text and turns special characters into entity references. HTML uses &lt;, &gt; and &amp; in place of <, >, and &.
  2. The second section of regular expressions turns spaces and tabs into nonbreaking spaces. We do this because HTML collapses white space on display. To preserve our formatting, we need to turn our spaces and tables into nonbreaking spaces, which are specified using the &nbsp; entity.
  3. The last regular expression turns the returns into <br> tags with returns after them. These <br> tags are line break symbols in HTML.
  4. This is the HTML preamble that goes before the code text that we have converted into HTML. The important thing to notice is the insertion of the filename using Ruby’s #{expression} syntax.
  5. This is the HTML post-amble that closes up the HTML output file.
Code munger 4: filtering XML

Another common use for a code munger is to filter the input file into a new output file. These types of code mungers are usually built as a one-off to do some type of processing on a file where the output is then used as the new master file. The original file and the generator are then discarded.

This code munger example takes some XML as input and adds some attributes to it, and then creates a new output file with the adjusted XML. Here is our example input file:

        <class name="foo">

We want to add a type attribute to each field. To keep the example small, the type will always be integer. The output file should look like this:

        <class name='foo'>
               <field type='integer'>bar1</field>
               <field type='integer'>bar2</field>

One approach to transform between two XML schemas is to use Extensible Stylesheet Language Transformations (XSLT). Another is to make a dedicated code munger that acts as a filter. The code for the XML filter code munger appears in listing 4.7.

Listing 4.7. Code munger 4: Transforms an XML file

  1. Using we construct a new XML document object from the contents of the input file. If this is successful, then all of the XML in the input file will be in memory.
  2. The each_element iterator finds all of the elements that match the given pattern and then passes them into the code block. You should note that the code uses a hierarchical specification for the node by passing in the string "class/field". This tells each_element that we want the field elements within the class elements.
  3. Using the add method we add a new attribute object to the field. This attribute object is created with a key and value pair where the key is type and the value is integer.
  4. Rexml makes building an XML file easy. You create the file using the standard Ruby File object, then use the to-string method (to_s) on the rexml root object to create the text of the altered XML.
Code munger 5: generating C static data from a CSV file

This munger uses the data stored in a CSV file to build a C implementation file that contains all of the data in an array of structures. You could use this file in building static lookup data for use during execution from the output of a database or spreadsheet.

Here’s our example data file:


We would like to take the names of these legends of computer science and store them in C as an array of structures, as shown here:

struct {
  char *first;
  char *last;
} g_names[] =
  { "Larry","Wall" },
  { "Brian","Kerninghan" },
  { "Dennis","Ritchie" },
  { "","" }

This code munger is different from the previous code mungers because it uses an ERb template to generate the output code. The Ruby portion of the generator is shown in listing 4.8 and the template is shown in listing 4.9.

Listing 4.8. Listing 4.8 Code munger 5: Creates C constants from a CSV file

  1. The Struct class facilitates the creation of classes that are just structures. In this case, we create a new class called Name with the member variables first and last. The Struct builder also creates get and set methods for each field. In this case, it means that you can use the obj.last or obj.last = "smith".
  2. Here we split the line into its two components, the first and last names, which are delimited by a comma. The map method is used to iterate over the fields and to remove any trailing carriage returns by calling the chomp method on each item.
  3. This code builds and executes the ERb template for the C data file. The template is run when the result method is called. The local variables are passed in by using the binding method.
Listing 4.9. Template for code munger 5

Code munger 6: building base classes

The last code munger in this chapter shows an easy way of building base classes with rudimentary functionality; you can then derive from those classes to create a completed product.

This may seem a little backward, but it will make sense by the end of the example. You start with the derived class. The base class is specified by the derived class, but the base class itself does not yet exist. The derived class has some comments that specify the design of the base class. The derived class is fed into the code munger, which creates the base class using these instructions. The derived class and base class are then compiled to create the output. This flow is shown in figure 4.7.

Figure 4.7. The inputs and output for flow for code munger 6

This example builds base classes that are primitive data structure objects. A data structure object contains only private instance variables and simple get and set methods to alter the values. These types of classes are grunt work to build so they are ideal for a simple generator.

Let’s start with the derived class:

// @field first
// @field middle
// @field last

public class Person extends PersonBase
  public String toString()
    return _first + " " + _middle + " " + _last;

Here we have created a new class called Person, which should inherit from the as-yet nonexistent base class PersonBase. PersonBase will be created by the generator. We would like the Person class to have three fields: first, middle, and last. To make that happen, we write the three comments at the top of the file; they start with // @field. These comments tell the generator that the base class should have three fields: first, middle, and last. You could add an extension to handle types.

The base class looks like this:

public class PersonBase {

  protected String _first;
  protected String _middle;
  protected String _last;

  public PersonBase()
    _first = new String();
    _middle = new String();
    _last = new String();

  public String getFirst() {
    return _first;
  public void setFirst( String value ) {
     _first = value;

  public String getMiddle() {
     return _middle;
  public void setMiddle( String value ) {
    _middle = value;

  public String getLast() {
    return _last;

  public void setLast( String value ) {
    _last = value;


This base class has the three instance variable declarations: the constructor, and the get and set methods for each instance. We removed the JavaDoc comments for brevity. You need to make sure that the output has JavaDoc when a generator such as this is used in production.

The Ruby code for the generator is shown in listing 4.10. It is similar to the generators we examined at the beginning of this chapter. The new feature is the use of an ERb template to build the output.

Listing 4.10. Code munger 6: Creates Java base classes for data storage

  1. This regular expression replacement code finds the .java at the end of the filename and replaces it with a null string, effectively removing it. The regular expression is shown in more detail in figure 4.8.
    Figure 4.8. A regular expression that finds .java

  2. This regular expression finds the // @field field-name comments in the file. The field name is put into $1. Figure 4.9 is a diagram of the regular expression broken out into more detail. The steeplechase segment at the beginning (\/\/) corresponds to the two forward slashes at the start of the comment. After that, we look for some white space (\s+), the text @field, and some more white space (\s+). Then we group all of the text together before the return (\n).
    Figure 4.9. A regular expression for finding a field definition comment

  3. Here you create the ERb object with the contents of the template file. Run the template to get the output text. The call to binding gives the ERb template access to all of the variables and functions within the current scope.

Listing 4.11 shows the ERb template that is used by code munger 6.

Listing 4.11.

  1. Here the template uses the each iterator on the fields variable to build each field definition.
  2. Next, we use the fields array to create new invocations for each of the fields in the structure.
  3. The fields array builds the get and set methods. The capitalize method capitalizes the first character of the string.

4.1.2. Developing the generator

Developing a code munger is straightforward. The process flow in figure 4.10 shows the steps involved.

Figure 4.10. The design and implementation steps for developing a code munger

Let’s look at these steps in more detail:

  • Build output test code —First, determine what the output is supposed to look like. That will give you a target to shoot for and will make it easy to figure out what you need to extract from the input files.
  • Design the generator —How should the code munger get its input? How should it parse the input? How should the output be built—using templates or just by printing the output? You should work out these issues before diving into the code.
  • Develop the input parser —The first coding step is to write the input processor. This code will read the input file and extract the information from it that is required to build the output.
  • Develop templates from the test code —Once the input is handled, the next step is to take the output test code that was originally developed and use it to either create templates or build the code that will create the output.
  • Develop the output code builder —The last step is to write the glue that takes the data extracted by the input parser and feed it to the output formatter. Then, you must store the output in the appropriate output files.

This is not meant to be a blueprint for your development. Each code munger is slightly different, and you’ll add and remove development phases accordingly.

4.2. The inline-code expansion generator model

Inline-code expansion is an easy way to simplify source code. Sometimes your source code is so mired in infrastructure work that the actual purpose of the code gets lost in its implementation. Inline-code expansion simplifies your code by adding support for a specialized syntax, in which you specify the requirements for the code to be generated. This syntax is parsed by the generator, which then implements code based on the requirements.

A standard use of inline-code expansion is embedding SQL statements within code. SQL access requires a lot of infrastructure: you have to get the database connection handle, prepare the SQL, marshal the arguments, run the command, and parse and store the output. All of this infrastructure code obscures the fact that you are running a simple SELECT. The goal of inline expansion is to allow the engineer to concentrate on the SQL syntax and leave the infrastructure code to the generator.

Here is an example of some code marked up with an inline SQL statement, which is fed into the generator as input:

void main( int argc, char *argv[] )

   <sql-select: SELECT first, last FROM names>

The output of the generator is:

void main( int argc, char *argv[] )
   struct {
      char *first;
      char *last;
   } *sql_output_1;
     db_connection *db = get_connection();
     sql_statement *sth = db->prepare( "SELECT first, last FROM names" );
     sql_output_1 = malloc( sizeof( *sql_output_1 ) * sth->count() );
     for( long index = 0; index < sth->count(); index++ )
        // ... marshal data

This output is C pseudo-code and not for any particular SQL access system, but it demonstrates that the actual intent of the original SELECT statement is lost in a quagmire of code.

In this example, we have created a new language on top of C; let’s call it SQL-C. The job of the generator is to convert SQL-C into production C for the compiler.

From the design and implementation perspective, the inline-code expander model is a formalized code munger. The input is source code from any computer language. That said, the source code is augmented with some type of markup that will be replaced during the generation cycle with actual production code. The output file has all of the original code, but the special markup has been replaced with production code. This output file is then used for compilation to build the actual product.

4.2.1. Uses and examples

Among the many possible uses for inline-code expansion are:

  • Embedding SQL in implementation files
  • Embedding performance-critical assembler sections
  • Embedding mathematical equations, which are then implemented by the generator

The decision of when to use an inline-code expander versus object-oriented or functional library techniques can be critical. C++ has support for templating and inline expansion within the language. You should look to these tools before contemplating the building of a new language style based on inline-code expansion.

One important reason to avoid inline-code expansion is that the technique can make debugging difficult because you are debugging against the output of the generator and not the original source code. Therefore, the choice to use inline code expansion should be made with great regard for the both the positives and the negatives associated with using this technique. The technique has been used successfully in products such as Pro*C and SQLJ, which embed SQL in C and Java respectively, but there are issues with debugging the resulting code.

Figure 4.11 depicts the input and output flow for the inline-code expansion model.

Figure 4.11. The input and output flow for an inline-code expansion generator

The code expander reads the input source code file, replaces any comments, and creates an output file with the resulting code. The output code is fed to the compiler and is used directly as part of the production application.

Let’s take a look at an example. This inline-code expander will take strings marked in brackets and implement them with printf. The input filename is; the output filename is test.c. Here’s our test input file:

int main( int argc, char *argv[] )
        <Hello World>
        return 0;

Our input file specifies that the "Hello World" string should be printed to standard output. The input file is fed to the generator, which outputs the following:

int main( int argc, char *argv[] )
        printf("Hello World");
        return 0;

As you can see, the "Hello World" string has been wrapped in a call to printf to send it to the standard output. While this example is not particularly useful on its own, it demonstrates the simplest form of the inline-code expansion technique.

The Ruby code that implements this generator is shown in listing 4.12.

Listing 4.12. Listing 4.12 Inline-code expander: Printf maker

  1. The regular expression substitution routine replaces every occurrence of <text> with printf("text");. The gsub! method finds all of the occurrences of the regular expression within the string. It then invokes the block on each occurrence and replaces the occurrence with the result of the block. The regular expression is shown in exploded form in figure 4.12.
    Figure 4.12. The regular expression for finding the printable text

    The < and > characters in the regular expression need to match exactly. In between the two you can place any sequence of characters. In practice, you may want to tighten up the expression to avoid grabbing legitimate C code and regarding it as an expression.
  2. This regular expression substitution replaces the .cx and the end of the filename with .c. Figure 4.13 describes the regular expression. The [.] matches an actual period, the cx matches the string "cx", and the end of the string is marked by $. This expression will match any .cx at the end of a string or, in this case, a filename.
    Figure 4.13. The regular expression for finding .cx

4.2.2. Developing the generator

Figure 4.14 shows a sample process for developing an inline-code expansion code generator.

Figure 4.14. The design and implementation steps for developing an inline-code expansion generator

Let’s examine these steps in greater detail:

  • Build the test code —First, build the code you would like the generator to create. That way, you can compile it and see if it works for you.
  • Design the generator —Next, design the input file format for the generator. You should also sketch out the flow of your generator.
  • Develop the input parser —The first step in building the generator is to read the input file and extract any information from it that is not related to the code replacement sections. Our code does not do this because the example was too simple. However, there is an advantage to being able to specify global options embedded in the C file that will affect how the code is generated. These options would not be replaced during generation; they would only set flags that would change the style of code that is generated.
  • Develop the code replacer —The next step is to build the code replacement regular expressions. These are the expressions that will find the special markup sections within the input file.
  • Develop the templates from the test code —Once you have found the special markup, you can spend time building the templates that will build the output code. In our example code, we didn’t use a template because our generator was too simple. For more complex generators, you can use a template-based approach. To create the templates, use the test code that you developed at the beginning of this process as a basis.
  • Develop the output code builder —The last step is to merge the code replacer and the output code builder. Once this is completed, you will have the finalized output code in memory, and all you need to do is build the output file.

4.3. The mixed-code generator model

The mixed-code generator is a more practical implementation of the inline-code expansion model. The generator reads the input file, makes some modifications to the file, and then saves it back into the input file after backing up the original.

The potential uses are similar. Using special markup, the generator builds implementation code to match the requirements specified in the markup.

The key difference between the two models is the I/O flow. In the mixed-code generation model, the input file is the output file. Mixed-code generation thus avoids the debugging problems inherent with inline-code generation.

To demonstrate the difference between the two models, we’ll show the same example from the inline-code expansion introduction implemented as mixed-code generation, starting with the input:

void main( int argc, char *argv[] )
   // sql-select: SELECT first, last FROM names
   // sql-select-end

Note that the <sql-select: ...> syntax has been replaced with specially formatted comments.

The output of the generator is:

void main( int argc, char *argv[] )

   // sql-select: SELECT first, last FROM names
   struct {
      char *first;
      char *last;
   } *sql_output_1;
     db_connection *db = get_connection();
     sql_statement *sth = db->prepare( "SELECT first, last FROM names" );
     sql_output_1 = malloc( sizeof( *sql_output_1 ) * sth->count() );
     for( long index = 0; index < sth->count(); index++ )
        // ... marshal data
   // sql-select-end

Notice how the comments are maintained but that now the interior is populated with code. The code that implements the requirements is specified in the comments. The next time the generator is run, the interior will be removed and updated with newly generated code.

Mixed-code generation has advantages over inline-code expansion:

  • The use of comments avoids any syntax corruption with the surrounding code.
  • By using comments in the original file, you can take advantage of special features of any IDE, such as syntax coloring or code hints.
  • You can use a debugger because the input file is the same as the output file.
  • The output code is located right next to the specification, so there is a clear visual correspondence between what you want and how it is implemented.

4.3.1. Uses and examples

The potential uses for mixed-code generation are similar to those for inline-code expansion. However, because of the proximity of the markup code to the generated code you may think of using it for other types of utility coding, such as:

  • Building rudimentary get/set methods
  • Building marshalling code for user interfaces or dialog boxes
  • Building redundant infrastructure code, such as C++ copy constructors or operator= methods.

As we mentioned earlier, the major difference between inline-code generation and mixed-code generation is the flow between the input and the output. Figure 4.15 shows the flow for mixed-code generation. As you can see, the generation cycle uses the source code as both the input and the output. This is the same code that is sent to the compiler and used as production code.

Figure 4.15. The inputs and output flow for a mixed-code generator

It is the responsibility of the generator to retain a backup of the original code before replacing it with the newly generated code. When you use this model, make sure that you manage the backups and have a reliable source code control system.

Our first example of the mixed-code generation type will build print statements. Here is the input file for our example:

int main( int argc, char *argv[] )
// print Hello World
// print-end
        return 0;

We’ve changed the <...> syntax into comments. There are two reasons for this change. First, the code is compilable both before and after generation. Second, the comments are maintained between generation cycles so that the generator knows which parts of the code to maintain. The output of the generator, which is in the same file as the input, is shown here:

int main( int argc, char *argv[] )
// print Hello World
printf("Hello World");
// print-end
        return 0;

The original comments are retained and the implementation code has been put in-between the start and end comments.

Do you need start and end comments? Yes. You need a predictable ending marker for the regular expression. Otherwise, you would not know which code belonged to the generator and therefore could be replaced. You could end up replacing the contents of the file from the starting marker to the end of the file.

Listing 4.13 contains the code that implements our simple mixed-code generator.

Listing 4.13. Mixed-code generator 1: building printfs

  1. This regular expression finds the // print ... and // print-end markers and all of the content between the two. The // print text goes into $1; the print specification goes into $2. The generated code in the middle, if it is there, goes into $3, and the // print-end goes into $4. The regular expression is shown in exploded form in figure 4.16.
    Figure 4.16. The regular expression that finds the special markup comments

  2. This creates a printf call from the string that was specified in the comment.
  3. This puts the expression back together by adding the code text to the $1, $2, and $4 groups that we preserved from the regular expression.

4.3.2. Developing the generator

Figure 4.17 shows a simple development process for building a mixed-code generator.

Figure 4.17. The design and implementation steps for developing a mixed-code generator

As you can see, this is very similar to the process for developing an inline-code expander:

  • Build the test code —First, build the code you want to see come out of the generator. That means writing some special markup comments and also identifying the code to be generated.
  • Design the generator —Sketch out the code flow for the generator.
  • Develop the input parser —If you want to include any options that you can specify in the input file, this is the time to implement the parsing for that. You need to develop the code that reads the input file and scans for any options that will be used to modify the behavior of the generator. Our earlier example doesn’t have any options, but you could imagine that there might be an option for specifying a custom procedure instead of printf.
  • Develop the code replacer —Next, build the regular expression that will read the replacement sections. This expression should find the starting and ending blocks, as well as the arguments and the code in the interior. The example code shows a typical regular expression for this purpose.
  • Develop the templates from the test code —Now that you have identified replacement regions, you need to develop the code that will populate them with the generated code. The example code is so simple that all you need is to do is some string formatting to build the printf statement. If the requirements of your generator are more complex, you may want to use some of the ERb templating techniques shown in chapter 3, “Code generation tools.”
  • Develop the output code builder —The final step is to merge the code replacer with the output code builder to create the final output code. Then you need to back up the original file and replace it with the newly generated code.

4.4. The partial-class generator model

The partial-class generation model is the first of the code generators to build code from an abstract model. The previous generators used executable code (e.g., C, Java, C++, SQL) as input. This is the first generator to use an abstract definition of the code to be created as input. Instead of filtering or replacing code fragments, the partial-class generation model takes a description of the code to be created and builds a full set of implementation code.

The difference between this model and tier generation is that the output of a partial-class generator should be used in conjunction with some derived classes that will augment and override the output. Both the base class and the derived class are required to create the fully functioning production form. Tier generation requires no such derived classes because it takes responsibility for building and maintaining all of the code for the tier.

Partial-class generation is a good starting point for tier generation. The advantage of building only part of a class is that you have the ability to override the logic in the generated classes if your business logic has custom requirements that are not covered by the generator. Then, as you add more functionality to the generator, you can migrate the custom code from the user-derived classes back into the generated classes.

4.4.1. Uses and examples

Here are some common uses for partial-class generation:

  • Building data access classes that you can override to add business logic
  • Developing basic data marshalling for user interfaces
  • Creating RPC layers that can be overridden to alter behavior

Figure 4.18 shows the I/O flow using a definition file as the source of the base class structure information.

Figure 4.18. A partial-class generator using a definition file for input. Note that the output of the generator relates to the handwritten code.

In a partial-class generation model, your application code should never create instances of the base classes directly—you should always instantiate the derived classes. This will allow the partial-class generator to transition to a full-tier generator by generating the derived classes directly and removing the original base classes from the project.

Let’s look an example. This partial-class generator is closely related to code munger 6. Both generators are building classes for structured storage based on a defined set of the required fields. The output of the generator is a set of classes based on the contents of a simple text definition file, shown here:


This small test file specifies one class, Person, which has fields named first, middle, and last. We could have specified more classes by adding extra lines, but just having one class keeps the example simple.

The output class, Person, is shown here:

public class PersonBase {

  protected String _first;
  protected String _middle;
  protected String _last;

  public PersonBase()
    _first = new String();
    _middle = new String();
    _last = new String();

  public String getFirst() { return _first; }
  public void setFirst( String value ) { _first = value; }

  public String getMiddle() { return _middle; }
  public void setMiddle( String value ) { _middle = value; }

  public String getLast() { return _last; }
  public void setLast( String value ) { _last = value; }


Listing 4.14 shows the code for the generator.

Listing 4.14. Partial-class generator: building structure classes from a CSV file

  1. The split method splits a string into an array on the specified delineator. In this case, we use the : character as a delineator. The two elements of the string are then put into the class_name and field_text variables.
  2. After you have the field_text, you want to break it up into an array of fields. Use the split method again to break up the string, using the comma character as a delineator.
  3. Here you create an ERb object with the contents of the text template. Then, invoke the template using the result method. Pass in the variables from the current scope using the binding method.

The t1.rb generator uses one template,

  1. Here you iterate through each field and create a String instance variable for the field.
  2. In the constructor, this code replaces the class name and then creates calls to new for each field.
  3. Finally, this code iterates through each field to create get and set routines with the right field name.

4.4.2. Developing the generator

The example process flow shown in figure 4.19 could be used to develop a partial-class generator.

Figure 4.19. A set of design and implementation steps for building a partial-class generator

Let’s examine these steps in detail:

  • Build the base class test code —First, design and build the base class you want the generator to create.
  • Build the derived class test code —Next, build the derived class that will use the base class as a test case. You should also build the definition file that specifies the base class. The generator will take this file as input.
  • Design the generator —After building the input, output, and definition files, you need to spend some time creating a simple design for the generator.
  • Develop the input parser —Your first implementation step is to develop the parser that will read the definition file and store any elements that are required by the generator.
  • Develop the templates from the test code —The next step is to take your original base class code and to turn it into an ERb template for use by the generator.
  • Develop the output code builder —The final step is to merge the input parser with the template invocation code and to write the code that will create the output files.

4.5. The tier generator model

A tier generator builds all of the code for one tier or section of an application. The most common example is the constructor of a database access layer tier of a web or client/server application. In this section, we show several forms of a very basic tier generator. You can use the code here as the basis for your tier generator.

4.5.1. Uses and examples

Besides building database access layers, there are a number of possible uses for a tier generator. These include creating:

  • The RPC layer of an application that exports a web services interface
  • The stub code in a variety of different languages for your RPC layer
  • The dialog boxes for a desktop application.
  • The stored procedure layer for managing access to your database schema
  • Data export, import, or conversion layers

The input to a tier generator is a definition file from which it gathers all of the information required to build the complete code for the tier. The basic I/O flow is shown in figure 4.20.

Figure 4.20. The input and output flow for a tier generator

The generator takes a definition file, which contains enough information to build all of the classes and functions of the tier. It then uses this information in conjunction with a reservoir of templates to build the output source code. This output code is production ready and waiting for integration into the application.

Our first example of tier generator builds structure classes from XML. It is very closely related to the partial-class generator example. The only difference is the name of the classes that it generates and the format of the definition file. At the implementation level, the two types of generators are very similar—it’s in the roles of the two types of generators that there is a larger divergence.

The role of the tier generator is to take responsibility for all of the code in the tier. The partial-class generator model opts instead to take partial responsibility for the tier and then lets the engineer create the rest of the derived classes that will complete the functionality for the tier.

This structure generator uses XML to specify the classes and their fields, as opposed to the simple text file in the partial-class generator version. The XML version of the same file used in the original generator is shown here:

        <class name="Person">

Note that although this XML file is much bigger, it has the ability to expand to handle extra information more easily than a simple text file. For example, you could add types to each field by specifying a type attribute on the field tag. To do the same on the text file would require creating an increasingly complex syntax. XML also has the advantage of having editing and validation tools that come in handy as your XML files grows larger.

Listing 4.15 contains the code that implements the XML version of the structure class tier generator.

Listing 4.15. The tier generator: building structure classes from XML

  1. The each_element iterator finds all of the elements matching the specified name and passes it into the code block. Here you use the each_element iterator to get all of the class elements in the XML file.
  2. Next, use the attributes method to get an array of the attributes on the class element and dereference the array to get the name attribute.
  3. This code uses the same each_element iterator to get the fields from the class element. Inside the code block, you push the field name into the array of fields. Use the text method to get the text of the element and the strip method to remove any leading or trailing white space.
  4. This code creates the ERb template object from the text template. It then runs the template using the result method and passes the variables in the current scope into the template with the binding method.

Listing 4.16 shows the template used by the tier generator to build the data structure classes.

Listing 4.16.
public class <%= class_name %> {
<% fields.each { |field| %>
  protected String _<%= field %>; <% } %>

  public <%= class_name %>()
  { <% fields.each { |field| %>
    _<%= field %> = new String(); <% } %>
<% fields.each { |field| %>
  public String get<%= field.capitalize %>() { return _<%= field %>; }
  public void set<%= field.capitalize %>( String value ) { _<%= field %> =
value; }
<% } %>

4.5.2. Developing the generator

Developing a tier generator is much like building any of the other generators in this chapter. We show the steps in figure 4.21, but you should follow your own path; these steps are merely a starting point.

Figure 4.21. The design and implementation steps for developing a tier generator

Let’s examine these steps in detail:

  • Build the test code —One of the great things about building a code generator is that you start at the end. The first step is to prototype what you would like to generate by building the output code by hand. In most cases you already have the code you want to generate in hand because it’s already part of the application.
  • Design the generator —Once you have the output, you need to determine how you are going to build a generator that will create the test code as output. The most important job is to specify the data requirements. What is the smallest possible collection of knowledge that you need to build the output class? Once you have the data requirements, you will want to design the input definition file format.
  • Develop the input parser —The first implementation step is to build the definition file parser. In the case of XML, this means writing the Rexml access code to read in the XML and to store it in some internal structures that are easy for the templates to handle.
  • Develop the templates from the test code —Once the input is coming into the generator, you must create the templates that will build the output files. The easiest way to do this is to take the test code you started as the starting point for the template.
  • Develop the output code builder —The final step is to write the glue code that runs the templates on the input specification from the definition file and builds the output files. At the end of the development, you should be able to run the generator on your definition file and have the output match the test code that you created in the beginning of the process.

4.6. Generating for various languages

Each language creates unique issues for code generation. Each has a unique coding style; in addition to this implicit style, most companies have coding guidelines that enforce additional style requirements. Your generated code should follow this style, just as if you had written the code by hand. This is one of the most important reasons to use templates when building code. Templates allow you the freedom to add white space and position generated code as if you were writing it yourself.

A common complaint about generated code is that the code is not visually appealing. Ugly code conveys a lack of care to the reader and disrespect for the output language and the production code. You should strive to make your templates and their resulting code as clean as if you had written every line yourself. In this section, we explore each of the popular languages and offer advice on using generation techniques and avoiding the common pitfalls.

4.6.1. C

For C, you’ll need to keep these considerations in mind:

  • Only externally visible functions and constants should be visible in generated header files.
  • Internal functions and constants generated within the implementation files should be marked as static.
  • Choose and use an inline documentation standard (e.g., Doxygen).
  • If you have a development environment that allows for code collapsing while editing, be sure to develop the code in a way that matches the code-collapsing standard.
  • Set your compiler to the highest warning level and generate code that compiles without warnings.
  • Consider bundling all of the generated code into a single library and then importing that as a component into the application.

4.6.2. C++

In addition to the guidelines for C generation, you should take into account these factors when generating C++:

  • Consider using C++ namespaces to contain the generated code in a logical grouping.
  • Consider using a pattern where you expose an abstract base class with only public members. This base class is exported in the header. Then, use this abstract class as the basis for an implementation class, which is implemented only in the .cpp file. A factory method can be exported that creates a new instance of the derived class. This will ensure that recompiles will not be necessary when private methods or variables are changed by the generator.

4.6.3. C#

C# appears to have been designed for generation. You should use the inherent language features to your advantage:

  • Use the #region pragma to allow for code collapsing in the editor.
  • Consider packaging all of the generated code as an assembly to reduce compile times and to create logical boundaries between system components.
  • Make use of the XML markup capabilities to allow for easy documentation generation.

4.6.4. Java

Here are some suggestions when generating Java:

  • For generator input, use the JavaDoc standard when appropriate and the Doclet API for access to your JavaDoc comments.
  • For generator output, be sure to include JavaDoc markup in your templates to ensure that you can use the JavaDoc documentation generation system.
  • Consider generating interfaces for each of your output classes. Then, generate concrete classes for these interfaces. This will reduce compile and link times as well as loosen the coupling between classes.

4.6.5. Perl

Here are some suggestions when generating Perl:

  • You should always include use strict in the code you generate and run perl with the –w option.
  • Generate POD documentation along with the code.
  • Be sure to use package containment and explicit exports to ensure that your functions don’t bleed into the main namespace.

4.6.6. SQL

Keep these suggestions in mind when generating SQL:

  • You should spend the time to make sure that the output SQL is marked up with the portions of the SQL code that diverge from the SQL standard to use vendor-specific options.
  • You should separate the schema, any stored procedures, and any test data into separate files.
  • When generating SQL within another language (e.g., Java executing SQL using strings and JDBC), be sure to use the argument replacement syntax within the driver. In Java this means using the PreparedStatement class and the question mark replacement syntax.

4.7. Summary

This chapter provided you with a set of starting points that you can use to build your own generators. From this point on, we will be taking these generators and building on them to show how they can be used to solve a number of common software engineering problems. In the next chapter, we’ll begin building generators for specific types of uses, starting with generators for user interface elements such as web forms.