Chapter 9. Handling data – Code Generation in Action

Chapter 9. Handling data

9.1 Code generation versus a library
9.2 A case study: a simple CSV-reading example
9.3 Technique: data adapters
9.4 Technique: binary file reader/writer
9.5 Finding tools to do it for you
9.6 Summary

Code that reads and writes data formats can be created by code generation techniques. The up-front benefit with a generator is that you can feel confident that you won’t get the keyboard entry errors that turn a long into an int and take two days to find. But the real value of using code generation for data handling is in the use of an abstract file definition.

From this abstract definition, you can generate code for the file format in a variety of languages and generate conversion routines, as well as documentation or test data. For example, given a single abstract definition for a file format, we could create libraries that read and write the format for C, Java, and Perl. Another example is creating data adapter code that builds data-munging code to convert data from one format to another.

In this chapter, we show you several generation strategies based on this philosophy of using an abstract definition for your file formats.

9.1. Code generation versus a library

File format reading is usually done with a library, so why use code generation? Let’s start by looking at the library approach. There are two strategies for building a library to support a file format. The first is simply to write the library with the text or binary structures implicit in the code. This is by far the most common technique.

With the second technique, you have the library read in a definition of the file format and then adjust its behavior to read the incoming stream. This library has to be pretty complex to handle a range of file formats. It also locks up a lot of design in the library itself that can’t be ported to other languages.

Most of the logic required to generate a file reader is available in a variety of languages. File I/O is fairly similar across the common languages. The base data types and the file I/O handling classes or routines change names, but the idea is pretty much the same.

Using a generator, you can create an abstract definition of your file format and generate implementations of that format in several languages. No single language implementation technique offers the same reward.

In the case study portion of this chapter, we’ll build a generator that builds comma-separated value (CSV) readers. In addition, we cover generation data translators and binary file reader/writers.

9.2. A case study: a simple CSV-reading example

Our case study generator builds a file reader that reads a list of names and ages from a CSV file. It’s a simple example, but it provides a context for additional functionality as your requirements for data import, export, and translation expand. Let’s start with an example of our file format:

"Jack","D","Herrington",34
"Lauren","C","Herrington",32
"Megan","M","Herrington,",1

As you can see, this file contains four fields (first name, middle name, last name, and age) separated by commas. You can use quotes in strings to avoid confusion when you have commas inside the data.

If you were to write an XML document that would describe your comma-separated format, it would look like this:

Name is the one entity defined in the file. The Name entity includes four fields: first_name, middle_initial, last_name, and age. Each field is defined with its type and username. To keep it simple, let’s define the type using Java types.

In the next section, we’ll look at what you can expect this generator to do for you.

9.2.1. Roles of the generator

Before we begin, let’s list what the generator will do:

  • Generate the code that will handle reading the CSV data and storing its contents in memory.
  • Check errors while reading the data.
  • Produce the technical documentation for any classes that are created.

The generator does not need to:

  • Document the structure of the file.
  • Write data in the output file format.
  • Handle cross-field validation.
  • Validate the entire data set.
  • Validate the data set against any other internal system resource.

Now you can address the architecture of the generator that will fulfill these requirements.

9.2.2. Laying out the generator architecture

We chose to use a partial-class generator model for the CSV reader generator because we want to build classes that are designed to be extended from an abstract model of the file format. The generator takes the definition of the file format and builds a Java file that reads that format. It uses one template for the Java file and one template for each of the different field types. The output of the templates is written into the Java file to complete the generation cycle. Figure 9.1 shows the block architecture of the generator.

Figure 9.1. A generator that builds Java classes to read CSV files

Next let’s look at the steps this generator will go through as it executes.

9.2.3. Processing flow

CSV reader generator

Here are the process steps used by the generator:

  • Reads in the file definition XML and stores it locally.
  • Follows these steps for each entity:

    • Invokes the Reader template with the entity name, the class name, and the fields. The Reader template uses the type templates to build the basic processing steps for each field.
  • Stores the output of the Reader template in the correct Java file.

Now you’re ready to build the code for your generator.

9.2.4. Building the code for the CSV reader generator

The generator uses Rexml to read the description of the CSV format into memory. It then uses the ERb text-template system to build the Java classes. Listing 9.1 shows the code for the CSV reader generator.

Listing 9.1. csvgen.rb

  1. The Struct Ruby class builds a new class with the specified fields. The fields are both readable and writable using the . syntax. This use of Struct creates a new class called Field that has the member variables java_name, java_type, and user_name.
  2. This section parses the XML into a class name, an entity name, and an array of fields where each field is specified using a Field object.
  3. The call to run_template builds the Java file using Reader.java.template. (This template is described in listing 9.2.)
  4. This code creates the Java file with the result of the template.

Next you build the template (listing 9.2) for the Reader Java class, which is used by the generator to build a CSV reader class. The template makes extensive use of iterators to create data members and handlers for all the fields.

Listing 9.2. Reader.java.template

  1. This portion of the template builds the entity class that will represent one record from the data.
  2. The _in member variable is the input stream from which the class reads the CSV input.
  3. The _data member variable is an ArrayList that hold references to all of the entity classes created when you read the file.
  4. This code builds the constructor for this class.
  5. read is the main entry point for the class. It reads the file and parses the data into the data array.
  6. process_line handles processing one line of the CSV file. The method uses a state machine to build each field into an array of strings.
  7. process_fields takes the strings from process_line and puts them one by one into the fields of the new entity object. The generator creates process methods for all of the fields, and they populate each field stored in memory with data of the correct type. You can override these in your derived class to create custom field processing.
  8. This section of the template builds the process methods for each field. The template uses run_template to invoke the correct template for the type of the field.

Each data type has its own template to get the return value in the class. This is the template used to build the contents of the process method for an integer field:

return Integer.getInteger( <%= field.java_name %> );

The template for a string type is shown here:

return <%= field.java_name %>;

Next you create the Java class. It’s a class called NameReader—the Name portion comes from the XML definition. Within this class is another class called Name that represents a single line of CSV information. The iterator in the template builds the fields within the Name class. The Java class definition then continues on into the heart of the reader, where you define the constructor, the I/O mechanism, and the individual handlers for each field. These handlers are built using iterators that cover each field in the definition XML. Listing 9.3 shows the generated Java class for your CSV format.

Listing 9.3. Reader.java.template output

  1. The Name class stores the data for each name in the CSV file.
  2. The NameReader constructor takes an InputStream, from which it will read the CSV stream.
  3. The get method returns the Name object at the specified index.
  4. The read method processes the data from the InputStream and stores it in the _data array.
  5. Each field has a process method, which you can override in a derived class to create any custom processing you may require.

The MyReader class derives from NameReader and could implement special processing for any of the fields. This example does not, however, make any changes to the field processing behaviors:

import java.io.*;
import NameReader;

class MyReader extends NameReader
{
  public MyReader( FileInputStream in )
  {
    super( in );
  }
}

Here’s the test code to use MyReader to read the Name entities from the file:

9.2.5. Performing system tests

You can test the generator by using the simple system test framework (described in appendix B) to verify that the output of the generator is the same as the known goods that you stored when you ran the test and found the output acceptable.

Here is the input definition file for the system test framework:

<ut kgdir="kg">
  <test cmd="ruby -I../lp csvgen.rb examples/classes.xml" out="examples/
Reader.java" />
</ut>

In the next section, we look at a different kind of generator that builds code to translate existing data into a predefined format for data import.

9.3. Technique: data adapters

Another common use for generation is to build adapters that read custom formats and convert the data into a form required by the target system. This is a data adapter technique because the adapter takes the data from one source and, using a set of filters and processing, adapts that data to a form suitable for a target system.

The ideal architecture has a data adapter framework into which each customized data adapter fits. Each adapter is suited to a particular input format. The framework defines the format that the adapter must match in order to bring data into the system. Let’s take a look at the specific tasks we expect this generator to do.

9.3.1. Roles of the generator

Start with a set of responsibilities for the generator:

  • Build the data adapter that converts the input into the form required for insertion into the database.
  • Create code that embeds documentation that will be used later to generate technical documentation.

To clarify the role of the generator, you must also define what the generator will not do. In this case, our generator will not be responsible for building design documentation for the adapter or creating documentation about the input format.

With this information about the role and responsibilities of the generator clearly understood, let’s examine the architecture of the generator.

9.3.2. Laying out the generator architecture

For the data adapter generator, we chose a tier-generation pattern because we want to build entire data adapters from abstract models. The generator reads the description of the data and builds the data adapter using a set of templates. You can write custom code to add fields or alter the interpretation of fields from the incoming data. The flow of input and output is shown in figure 9.2.

Figure 9.2. A generator that builds data adapters

When you build this generator, you already know the data format requirements of the target system. The only unknown is the incoming file format. This greatly simplifies the design of the generator as compared to a generator where both the input and output formats are unknown.

9.3.3. Processing flow

Data adapter generator

This generator is very application specific. This means that customizing it for your situation may not be simple. The general processing steps are as follows:

  • Reads in and stores the data description.
  • Reads in the custom code and merges it into the data description.
  • Creates a buffer to hold the code for all of the fields.
  • Follows these steps for each field:

    • If the field is processed without customizations, uses the standard field template to build the field code.
    • If the field has customized processing, uses the custom field template and gives it the custom code and the field definition.
  • Uses the adapter template to build the adapter with the field code buffer.

Finally, let’s make one last data-handling generator; this one builds reader/writer code for binary files.

9.4. Technique: binary file reader/writer

Proprietary binary file formats are excellent for reading and writing efficiency, but they lock customers into a file format that they cannot read without dedicated software. Using a generator to build your binary file reader/writer code has these advantages:

  • The file definition is abstracted so that anyone can write a program that can read and parse the definition file.
  • The generator can build support for the file format in a variety of different languages.
  • Documentation for the file format can be generated automatically.
  • The generator can be reused to build support for any number of binary file formats.

Now let’s look at some of the specifications for this type of generator.

9.4.1. Roles of the generator

First, let’s specify what the generator will and will not do. Our generator will be responsible for:

  • Creating the read and write code for a given file format in one or more languages.
  • Using the embedded documentation standard for the target language as you build the code for the file format handler.

The generator will not be responsible for:

  • Building unit test cases for the code.
  • Building design documentation for the test cases.
  • Building any application logic around the file-format handler.
  • Validating the input file beyond basic format size and string validation.
  • Performing cross-field validation of the input file as it is read.

With these responsibilities clearly defined, you can now specify the architecture of the generator.

9.4.2. Laying out the generator architecture

The generator takes the binary file format as input and uses a set of templates—one or more per language—to develop the output reader/writer code. Figure 9.3 shows the tier-generator architecture for the binary file reader/writer generator.

Figure 9.3. A generator that can build binary file I/O code for multiple output languages—in this case, C and Python

9.4.3. Implementation recommendations

Here are some tips that should help you when developing this binary file reader/writer generator:

  • This generator should not allow for custom code. Leave all of the language-specific elements of the library implementation to the templates. This ensures portability between languages.
  • The generator needs to ensure that all of the default values for the fields (e.g., the type and the default value) are set within the generator itself before the templates are invoked. If the generator handles the defaults, you will not have a problem with the templates handling the interpretation of the default values in different ways.
  • When you generate for multiple languages, you will have to create a set of constant types for each of the machine types. The generator leaves the interpretation of the machine types into language types up to the templates.
  • Some languages (e.g., Java) have implicit big or little endian ordering to their machine types. You need to account for this in the design of your file format. Ruby and Perl allow for both endian styles when reading binary files.

9.4.4. Processing flow

Binary file reader/writer generator

Here is the basic series of steps that you can use as a starting point for your generator:

  • Reads in the binary file format.
  • Normalizes all of the defaults so that the templates get fully defined structures for every field.
  • Runs the template for each language and stores the result.

9.5. Finding tools to do it for you

The field of binary file format reader/writer generators was limited at the time of this writing. The one we could find was Flavor, which builds C++ and Java file readers for binary audio formats from an XML description (http://flavor.sourceforge.net). A handy web site for finding the format of binary files is the Wotsit archive of file formats (www.wotsit.org).

9.6. Summary

Handling file formats and data conversion presents an ideal opportunity to use code generation, not only to develop the file support quickly but also to abstract the file format so that a number of different outputs can be created. This includes support for different languages, documentation, and test code.

This chapter has covered many of the key data-handling tasks with generation approaches. These tasks included reading and writing text (CSV) files, reading and writing binary files, and performing data-format conversion for input or export.

In the next chapter, we’ll look at generators that create database access layers.