Chapter 2. Code generation basics – Code Generation in Action

Chapter 2. Code generation basics

2.1 The various forms of active code generation
2.2 Code generation workflow
2.3 Code generation concerns
2.4 Code generation skills
2.5 Choosing a language for your code generator
2.6 Summary

This chapter introduces you to the range of code generation models that are used throughout the book. Code generators are separated into two high-level categories: active and passive. In the passive model, the code generator builds a set of code, which the engineer is then free to edit and alter at will. The passive generator maintains no responsibility for the code either in the short or long term. The “wizards” in integrated development environments (IDEs) are often passive generators.

Active generators maintain responsibility for the code long term by allowing the generator to be run multiple times over the same output. As changes to the code become necessary, team members can input parameters to the generator and run the generator again to update the code. All of the generators shown in this book follow the active generation model.

In this chapter we look at the various types of active code generators so you can select the one that best works for you. We examine the abstract process flow for a generator so you can understand what goes on under the covers. You’ll likely encounter many arguments against code generation, so we talk about what those concerns are and how to evaluate any risk they might present. Finally, we list the various skills you need on your team in order to succeed with code generation.

2.1. The various forms of active code generation

Within active code generation are several types of models that encompass a range of solutions—from the very simple and small to the large and complex. There are many ways to categorize generators. You can differentiate them by their complexity, by usage, or by their output. We chose to distinguish the generators by their input and output, arriving at six discrete types. This works well because the models tend to have vastly different architectures and are used for different problems and in different ways. By defining various models we can describe how these models are built and used, and then concentrate on their application later in the code generation solutions portion of the book (part 2).

In the sections that follow we examine these six models.

2.1.1. Code munging

Munging is slang for twisting and shaping something from one form into another form.

Given some input code, the munger picks out important features and uses them to create one or more output files of varying types. As you can see in figure 2.1, the process flow is simple. The code munger inputs source code files, most likely using regular expressions or simple source parsing, and then uses built-in or external templates to build output files.

Figure 2.1. The input and output flow pattern for a code munging generator

A code munger has many possible uses; you could use it to create documentation or to read constants or function prototypes from a file. In chapter 4, “Building simple generators,” we discuss seven code mungers. In addition, chapters 6 (“Generating documentation”) and 11 (“Generating web services layers”) present case studies based on the code munger model.

2.1.2. The inline-code expander

An inline-code expander takes source code as input and creates production code as output. The input file contains special markup that the expander replaces with production code when it creates the output code. Figure 2.2 illustrates this flow.

Figure 2.2. The input and output flow pattern for an inline-code expander

Inline-code expanders are commonly used to embed SQL into a source code file. Developers annotate the SQL code with distinguishing marks designed to be cues for the expander. The expander reads the code and, where it finds these cues, inserts the code that implements the SQL query or command. The idea is to keep the development code free of the infrastructure required to manage the query.

See chapter 4 for an example of an inline-code expander. Also, the case study in chapter 8, “Embedding SQL with generators,” provides a complete example implementation.

2.1.3. Mixed-code generation

A mixed-code generator reads a source code file and then modifies and replaces the file in place. This is different from inline-code expansion because the mixed-code generator puts the output of the generator back into the input file. This type of generator looks for specially formatted comments, and when it finds them, fills the comment area with some new source code required for production. The process is shown in figure 2.3.

Figure 2.3. The input and output flow pattern for a mixed-code generator

Mixed-code generation has a lot of uses. One common use is to build the marshalling code that will move information between dialog controls and their representative variables in a data structure. The comments in the code specify the mapping between data elements and controls, and the mixed-code generator adds an implementation that matches the specification to the comment.

Chapter 4 introduces the mixed-code generator. In addition, chapter 7, “Generating unit tests,” describes a case study that illustrates a practical use of the mixed-code generation model to augment C++ code with unit tests.

2.1.4. Partial-class generation

A partial-class generator reads an abstract definition file that contains enough information to build a set of classes. Next, it uses templates to build the output base class libraries. These classes are then compiled with classes built by the engineers to complete the production set of classes. Figure 2.4 shows the flow of input to and output from the partial-class generator. Figure 2.5 contains an example of the process.

Figure 2.4. The input and output flow for a partial-class generator, including the compiler and the code that uses the generated base classes

Figure 2.5. A partial-class generator can be used in a web application architecture.

Figure 2.5 illustrates how the output code of a partial class generator fits into a three-tier web server architecture. The data access layer beans are based on two classes. Once the partial-class generator builds the base class, the engineer adds the final touches in a derived class to create the production bean.

A partial-class generator is a good starting point for building a generator that creates an entire tier of code. You can start with building just the base class; then as the generator handles more of the special cases, you can transition to having the generator build all of the code for the tier.

Chapter 4 describes a partial-class generator, and chapter 9, “Handling data,” also uses a partial-class generator model.

2.1.5. Tier or layer generation

In this model, the generator takes responsibility for building one complete tier of an n-tier system. The case study in chapter 1 included examples of tier generators.

An example of tier generation is model-driven generation, wherein a UML authoring application is used in conjunction with a generator and an input definition file (often in XML) to output one or more tiers of a system.

As figure 2.6 shows, the input and output flow of a tier generator is the same as with a partial-class generator. The tier generator reads a definition file and then uses templates to build output classes to implement the specifications in the definition file. Figure 2.7 shows a tier generator building the data access layer for a three-tier web application.

Figure 2.6. The input and output flow for a tier generator

Figure 2.7. A tier generator builds a full tier of an n-tier web application.

The big difference between tier and partial-class generation is that in the tier model the generator builds all of the code for a tier. In the partial-class model, the generator builds the base classes but derived classes are still required to finish off the code for the tier.

The primary advantage of partial-class generation is speed of implementation. The most common difficulty in building a tier generator is designing for special cases, but with a partial-class generator, you can build a relatively simple generator and then implement the special cases with custom code. You should think of moving from partial-class generation to tier generation as a migration path. When the requirements and design are loose, you may still be able to develop a partial-class generator; after the problem space is well known—by the second or third release—you can upgrade the partial generator to a tier generator by migrating the special cases and custom code.

In chapter 4 we describe a simple tier generator, and chapters 6 and 9 present complete examples.

2.1.6. Full-domain language

A full-domain language is a Turing complete language customized to allow engineers to represent the concepts in the domain more easily. A Turing complete language is a general-purpose computer language that supports all of the variable management, logic, branching, functional, and object decomposition abilities included with today’s programming languages.

Code generation, as the term is used in this book, is about generating large amounts of high-level language production code based on descriptive requirements. The end of the spectrum of descriptive requirements is a Turing complete language. So it follows that the end of the spectrum of code generation is a Turing complete domain-specific language.

It is outside the scope of this book to describe the implementation of a domain-specific language. However, let’s look at the pros and cons of taking this route.

The advantage is that you have a very high-level functional description of the semantics of your solution that can be compiled into almost any high-level language. As for disadvantages, your team is buying into supporting a new language that you must maintain and document. In addition, you must train your colleagues in the use of this language. Also, with a fully functional language it is difficult to generate derived products (e.g., documentation or test cases). With a tier or partial-class model, the generator understands the structure and purpose of the class being built from the abstract definition. With a Turing complete language, the generator will not understand the semantics of the code at a high level, so automatically generating documentation or test cases will not be possible.

The case study in chapter 12, “Generating business logic,” presents a limited example of a full-domain language.

An example of Turing complete domain-specific language is the math language used by Mathematica, a language that supports matrix math in a simple manner. Using matrix math is difficult in traditional languages such as C, C++, or Java. Having a domain-specific language in a product like Mathematica allows the language users to spend more time concentrating on their problem by enabling them to present their code in familiar domain-specific terms.

2.2. Code generation workflow

As shown in figure 2.8, the classic workflow for developing and debugging code is “edit, compile, and test.” Figure 2.8 shows the edit, compile, and test workflow that cycles between the three states. Code generation adds a few new workflow elements, as shown in figure 2.9. The edit, compile, and test phase still applies for all of the custom code that is either used by or makes use of the generated code.

Figure 2.8. The edit, compile, and test workflow

Figure 2.9. The coding workflow when a code generator is involved

The left-hand side of figure 2.9 shows the generation workflow. First, you edit the templates and definition files (or the generator itself) and then run the generator to create the output files. The output files are then compiled along with the custom code and the application is tested.

If your target language is not a compiled language, then simply disregard the compile phase in the diagrams.

2.2.1. Editing generated code

Unless you are using the inline-code expander model (in which the output files are used in the generation cycle), you should never edit the output files of the generator directly. Generators completely replace the output files so that any revisions to the output files created by the previous generation will be lost. For this reason, we recommend that you bracket in comments the implementation files that are output, to specifically warn the user not to edit them.

There is one exception to this rule. When you are debugging the code in the templates it is easier to edit the output files, diagnose the problem, and then integrate the fix back into the templates.

2.3. Code generation concerns

No technique is without drawbacks, and code generation is no different. In the following sections, we describe some of the issues you may encounter when proposing, building, and deploying code generators.

Some of these are fear of the unknown, others are an unwillingness to change, and still others are well-founded technical concerns. All of these issues will need to be addressed at some level and at some time during the deployment of a generator. We list them here to help you with some counterpoints to the criticism.

2.3.1. Nobody on my team will accept a generator

Creating a tool that builds in minutes what would take months to write by hand will have a dramatic effect on the development team, particularly if no one on the team has experience with automatic code generation. With a small engineering team or one that has had generator experience, there probably won’t be any reticence toward adopting generation. With larger teams you may experience problems. Some engineers may dig in their heels and swear that they won’t use generators; they may try to influence other team members as well. Here are suggestions for handling these problems:

  • Start small —Take a small section of the code base that requires maintenance and replace it with a generated version. Ideally, this will be a section of code that could become larger over time so that the investment in building the generator has a high payoff. Start with a small section of code so that engineers and management can see how code generation works and understand the benefits for the organization.
  • Integrate with the architecture —When building a new generator, people often suggest changing the architecture of the application in conjunction with building the generator. The idea is to kill two birds with one stone. The first goal is to fix the architecture and the second is to speed up the development process. These findings are welcome of course, but coding more quickly and changing to a better architecture are two separate issues. Our recommendation for your first code generation project is to tackle building code in the existing framework. Once you’ve finished the generator and tested and stabilized the output code, you can alter the templates to move to a new architectural model.
  • Solve one problem —The ideal generator can solve a number of problems in the architecture and the implementation of a product, as well as accelerate the schedule. If this is the first generator going into a project, you should concentrate on the schedule issue alone. Taking on several tasks in the first release of a generator can open up too many issues at once and muddle the decision making. You should focus your effort on solving one or two key problems.

2.3.2. Engineers will ignore the “do not edit” comments

One of the most common problems with a generated code base is that engineers will ignore the plentiful warnings in the comments of the file indicating that the code should not be edited.

I can’t stress enough that your first reaction to someone breaking the “do not edit” rule should be sympathy. Engineers break the rule because they want to get something done and either they don’t understand how to run the generator or the generator simply could not do what they needed. In both cases, you need to educate your team and perhaps alter the function of the generator to address their reasons for ignoring the comments. The important thing is not the infraction, but the fact that a team member was using the code and possibly using the generator—which is a great starting point for the deployment of the generator.

If engineers continue to violate the “do not edit” rule, you may need to go back to the drawing board with the generator, the deployment, or both. Engineers who are conscious of the value of the generator and who understand how the generator works should not alter the output by hand and expect that their changes will be maintained.

2.3.3. If the generator can’t build it, the feature will take weeks to build

Suppose you’ve developed a generator for a user interface builds assembly code from a set of high-level UI definitions. It builds everything: the window drawing code, the message handling, the button drawing—everything, right down to the machine code.

What happens when you need to add a dialog box to your UI that cannot be built with this generator? In the worst-case scenario, you get out your assembly language manual and go for it.

This is an unrealistic and extreme example, but it proves the point that you always want your generator building code that sits on a powerful framework. An example is the Qt cross-platform UI toolkit. Figure 2.10 shows a generator for Qt code.

Figure 2.10. Generating code for the Qt framework

Should a generator replace the functionality of Qt? No. It should generate code for Qt from high-level UI definitions: the same high-level UI definitions that can be used to build the web interfaces that Qt can’t build. There is also nothing wrong with having the generator write code that makes use of your own libraries, which sit on top of powerful frameworks.

As with any software engineering project, you will want to spend the time picking an appropriate and powerful architecture and development environment. Using code generation techniques in combination with the right development tools can significantly enhance the quality and reduce the development time of your application development.

Figure 2.11 shows a generator that builds to both Qt and a company-specific interface library. Including code to manage the interface can make the generator run more smoothly, as well as make it much easier for an implementer to build the dialog boxes that cannot be created by the generator.

Figure 2.11. Generating code for a combination of Qt and a company-specific interface library

2.4. Code generation skills

In this section, we discuss the specific technical skills you’ll need to build generators. If you are unfamiliar with some of these skills, consider using them in practical code generation projects as an ideal opportunity to learn them.

2.4.1. Using text templates

Generating code programmatically means building complex structured text files en masse. To maintain your sanity and the simplicity of the generator, you should use a text-template tool such as HTML::Mason, JSP, ASP, or ERb. Using a text template means that you can keep the formatting of the code separate from the logic that determines what should be built. This separation between the code definition logic and the code formatting logic is an ideal abstraction, which we maintain in the generators presented in this book as case studies.

2.4.2. Writing regular expressions

Regular expressions are a tool for searching and replacing within a block of text. Reading in configuration files or scanning source files is greatly simplified by using regular expressions. Regular expressions allow you to specify a format for text, and then check to see whether the text matches in addition to extracting any information you require from the text.

If you have always thought of regular expressions as indecipherable line noise, now is the time to learn them. The regular expressions used in this book are fairly simple, and you should be able to learn from the examples in the text through context.

For an immersion course in regular expressions, I recommend the book Mastering Regular Expressions, by Jeffrey Friedl (O’Reilly, 2002). This excellent book gives you a solid grounding in the art of regular expression construction. Many people learn to use regular expressions as invocations of black magic; they use them without knowing precisely why they work. Regular expressions are an invaluable tool and worth your time to learn thoroughly. Once you have mastered them, the idea of writing string parsing code by hand will be forever dashed from your mind.

Appendix F shows examples of regular expressions implemented in Ruby, Perl, Python, Java, C#, and C.

2.4.3. Parsing XML

XML is an ideal format for configuration and abstract definition files. In addition, using schema or Document Type Definition (DTD) validation on the XML can save you from building elaborate error handling into the file-reading code.

There are two styles of XML parsers:

  • Streaming —Streaming parsers (e.g., Simple API for XML-SAX) send the XML to handlers in the host code as the file is being read. The advantage is that the memory footprint is low and the performance is good, allowing for very large XML files to be read.
  • DOM —DOM parsers read the entire XML stream, apply validation, and then return a set of in-memory objects to the host code. The memory footprint is much larger than a streaming parser, and the performance is significantly lower. The advantage of using a DOM parser is that the host code is much less complex than the corresponding host code for a streaming parser.

The case studies in this book use a DOM parser because the overall complexity of the code is much lower with DOM. Performance is usually not an issue because the code generator is run during development on fast machines.

2.4.4. File and directory handling

Code generators do a lot of file reading and writing as well as directory handling. If you have had experience only with the Win32 API or the standard C library for file and directory handling, you may be rolling your eyes at the concept of lots of directory access. Now is an ideal time to learn the joy of simple and portable file and directory handling available in Perl, Python, or Ruby.

The most common scripting languages have built-in, easy-to-use APIs for directory construction, directory traversal, and pathname munging, which work on any operating system without code alteration or special cases. The same is also true of file I/O, which is implemented in the core libraries of these languages.

2.4.5. Command-line handling

Code generators are usually command-line based. They are invoked either directly, as part of a check-in or build process, or from an IDE. The example code generators shown in this book are run off the command line.

On Macintosh OS X and other Unix derivatives (e.g., Linux or FreeBSD), the command line is accessible through terminal emulators. On Windows you may want investigate Cygwin, which is a Unix emulator that runs on top of Windows.

2.5. Choosing a language for your code generator

One of the advantages of separating the code writing from the production code itself is that you don’t have to use the same language for both tasks. In fact, it is often a good idea to use a different language for each task if only to distinguish the generator code from the code being generated.

Code generators are text processing-intensive applications. A generator reads a number of text files (such as configuration or template files), uses these files to create the output code, and stores the output code in production files.

So what makes a good language for building a generator?

  • The ability to easily read, parse, and search text files is critical.
  • Text templating is fundamental to code generation, which means support for easy-to-use and powerful text-template tools is important.
  • Native handling for XML would be handy, though support through an external library is acceptable. Today a number of high-quality XML authoring and validation tools are available. For that reason alone, XML is a good choice for a configuration file format. The ability to read and manipulate XML data structures is therefore essential.
  • Easy and portable file and directory maintenance and searching is necessary; it will be the job of the code generator to find and read input files and manage the output files for the generated code.

On the other hand, some language factors are not important for code generation:

  • Language performance is not a high priority because the code generator itself is not in production. It is the generated code that must match the performance metrics of the project.
  • Memory space efficiency is also of little concern because the generator is run off-line during development.

On the whole, you need a language that is efficient to develop and easy to maintain, and that supports the range of programming models—from simple scripts to advanced object-oriented designs.

2.5.1. Comparing code generator languages

Table 2.1 shows the pros and cons of using various computer languages for code generation. The differences in each of these languages make them good for some applications and not so good for others.

Table 2.1. Using computer languages for code generation

Language

Pros

Cons

C C++
  • If the generator is shipped to customers, they will not be able to read or alter the source.
  • The languages are suited to text parsing tasks.
  • Strong typing is not ideal for text processing applications.
  • The exceptional execution speed of C is not that important.
  • Directory I/O is not portable.
  • XML and regular expression tools are difficult to use.
Java
  • If the generator is shipped to customers, they will not be able to read or alter the source.
  • The code is portable to a number of different operating systems.
  • XML tools are available.
  • Text-template tools are available.
  • Java is not ideal for text parsing.
  • Strong typing is not ideal for text processing applications.
  • The implementation overhead is large for small generators.
Perl Ruby Python
  • The code is portable to a number of different operating systems.
  • The languages scale from small to large generators.
  • Text parsing is easy.
  • You can use built-in regular expressions.
  • The XML APIs are easy to use.
  • Text-template tools are available.
  • Other engineers will need to learn how to create and maintain code in these languages.

2.5.2. Using Ruby to write code generators

I have chosen Ruby as the programming language for the example code generators in this book for the following reasons:

  • Ruby has built-in support for regular expressions. Regular expressions are critical for parsing, searching, and modifying text.
  • Ruby has portable file and directory I/O constructs.
  • Ruby supports several robust text-template tools, including ERb and ERuby.
  • Ruby is easy to read and understand.
  • Ruby programmers have access to two superb XML libraries in rexml and ruby-libxml.
  • Ruby supports the full range of imperative programming models from one-line command-line invocations, to simple scripts, to functional decomposition, to full object-oriented designs.
  • Ruby can pack a lot of functionality into a small amount of code. This keeps the code samples in the book shorter.

Ruby was designed and implemented by Yukihiro Matsumoto and released to the public in 1995. As he put it in his introduction to the book Programming Ruby, “I wanted a language more powerful than Perl, and more object-oriented than Python.” What he developed was a language that is graceful, simple, consistent, and so fun and addictive to write in that you may find yourself wanting to use it in all of your projects.

Appendix A introduces Ruby and gives you a foundation for understanding all the examples in this book and extrapolating that information to whatever code-generation language you choose. You may even find that you feel comfortable enough with Ruby to make it your implementation language.

2.6. Summary

Code generation comes in a few standard forms and you need to be familiar with each of them and how they work in order to select the best one for your project. Although the decision can be difficult, in general you will find that one method will always be clearly superior to the others based on your architecture and goals.

We have also addressed some of the cultural issues around generation so that you can better empathize with the engineering customers of your generator and address their concerns properly. A tool that goes unused or neglected by the people it was meant to serve provides a service to no one, so special care must be given to the successful deployment of the tool.

This chapter provided some insights into the kinds of skills you need to write generators. In the next chapter, we examine the tools you can use to build generators. In addition, we introduce some simple generators that you can use as templates for your own work, as well as generators that fit vertical needs such as user interfaces, test cases, web services, and database access.