Navigate data with the Mapper framework

Build your own data mapping system with an interlingual approach

Most developers, at some point, have written software to move (and/or manipulate) data from two different data sources. Usually, the software that tackles this job is custom code specific to the data entities involved and the data itself. Fueling the fire, good data mapping software is typically very expensive for organizations with tight IT budgets, especially in today's market. The Mapper framework offers a simple and inexpensive (free) way for you to read from one data entity and write to another with minimal coding and maintenance.

In this article, I first explain the system's overall design and then demonstrate how the framework operates by mapping between a file and a database table. Using this example as a template, you'll be able to add other entities as your own specific requirements dictate and map data between them as easily as editing a few XML lines.

The framework

In Chapter 8.2 of the online book Survey of the State of the Art in Human Language Technology, Martin Kay explains that one algorithm for Machine Translation (MT), which translates text from one natural language (like English) to another, works by parsing the source text into a standard semantic form using the source language's grammar rules. It then applies the target language's grammar rules to the standard form to yield the desired translation. Of course, this is an oversimplification of how MT actually works, but this straightforward process, called the interlingual approach in MT, is the basis for the Mapper framework.

In contrast to the interlingual approach, another algorithm called the transfer approach attempts to translate texts by having a separate translation module from the source language to the target language. Given n languages in a system, n(n-1) translation mechanisms must translate each language to every other language. However, a solution similar to the interlingual approach significantly reduces the complexity of translating the languages, requiring only 2n translation mechanisms. The Mapper framework applies the interlingual approach to mapping data entities in a system, thereby making the system maintainable and easily extendable. The following figure illustrates this approach.

The Mapper framework's interlingual approach

The framework's semantic representation of data, for simplicity's sake, is a HashMap called MapperRecord (of course, you can use XML as an alternative representation). In addition, the Entity interface represents each data entity:

public class MapperRecord extends java.util.HashMap {
}
public interface Entity {
   public static int READ = 0;
   public static int WRITE = 1;
   public void open() throws MapperException; //Open the entity for reading or writing
   public void close() throws MapperException; //Close all reading and writing
resources
   public MapperRecord readRecord() throws MapperException; //Read/translate
record from 
     //Data entity into a MapperRecord
   public void writeRecord(MapperRecord record) throws MapperException; //Write
MapperRecord to the data entity
}

You should apply the following rules to smoothly map between arbitrary data entities:

  1. Every Entity implementation creates a bidirectional map between the data entity it represents and a MapperRecord -- mandated by the Entity interface. It should know how to marshal (or translate) data from its data source into a MapperRecord, as well as write a MapperRecord's contents to its respective data store.
  2. The rules (or grammar) for mapping between the data entity and the MapperRecord object are placed in an XML file that the entity parses at runtime.
  3. Every Entity implementation has a two-parameter constructor: the name of the map to use and the operation to perform (Entity.READ or Entity.WRITE). All other object variables should be accessible via getter and setter methods. (I will clarify later why this rule is necessary):

    protected String fileName;
    //Constructor that creates a file entity
    public FileEntity(String entityAlias, int operation) {
       this.fileMapName = entityAlias;
       this.operation = operation;
    }
    public void setFileName(String fName) { //Sets filename
       this.fileName = fName;
    }
    public String getFileName() { //Gets filename
       return this.fileName;
    }
    

Once all the framework's entities can successfully create and store MapperRecords based on XML metadata, you can effortlessly create execution paths to map data from one to another:

Entity readEntity = new FileEntity("from_map",Entity.READ);
readEntity.setFileName("/tmp/from.txt");
Entity writeEntity = new TableEntity("table_map",Entity.WRITE);
//Open entities for reading and writing
readEntity.open();
writeEntity.open();
//For each read record, write record to write entity
MapperRecord record;
while ((record = (MapperRecord)readEntity.readRecord()) != null) {
   if (record.isEmpty()) {
      continue;
   }
   writeEntity.writeRecord(record);
}
//Close entities
writeEntity.close();
readEntity.close();

The classic case

I originally designed this framework to reliably parse and create transaction-laden text files for exchange with business affiliates. Creating a custom Perl script for each affiliate's incoming (and outgoing) file formats is an arduous task for any development team, without even considering the testing and maintenance nightmares. As an alternative to Perl scripting, this reusable and extendable application pattern reduces the time spent on the development lifecycle's latter stages.

So let's start with the classic example of reading records from a text file and writing them to a database to show how well the design works. Creating the two entities, FileEntity and TableEntity, which implement the Entity interface, is fairly simple.

Parse and create any data file

The FileEntity class parses an XML file, like the following, to load different file formats into memory (using Apache's Xerces SAX parser):

<!-- FileEntityList.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<filemaps>
  <map name="from_map" delimiter="," > <!-- comma-delimited file 
format -->
    <field name="id" />
    <field name="amount" />
    <field name="date" />
  </map>
  <map name="to_map" delimiter="|" > <!-- pipe-delimited file format 
-->
    <field name="date" />
    <field name="amount" />
    <field name="id" />
  </map>
  <map name="fixed_map"> <!--fixed-length file format -->
      <field name="id" start="1" end="2" />
      <field name="amount" start="3" end="32" />
      <field name="date" start="33" end="62" />
  </map>
</filemaps>

The from_map map describes a comma-delimited file format while the fixed_map map describes a fixed-length file format. Armed with the my_map file format, the READ operation, and a filename, FileEntity's readRecord() method can marshal the file's comma-delimited records into a MapperRecord keyed by the field names in the XML file:

//Constructor that creates a file entity
public FileEntity(String entityAlias, int operation) {
  this.fileAlias = entityAlias;
  this.operation = operation;
}
//Sets filename
public void setFileName(String fName) {
  this.fileName = fName;
}
//Reads record from buffered reader and returns a MapperRecord
public MapperRecord readRecord() throws MapperException {
  //. . . 
  return (MapperRecord)transformLine(record);
  //. . . 
}
//Transforms record to MapperRecord based on xml specs
private MapperRecord transformLine(String record) {
  MapperRecord rec = new MapperRecord(); //Create empty mapper record
  HashMap map = (HashMap)mapList.get(fileMapName); //Get map
  ArrayList fieldList = (ArrayList)map.get(FIELD_LIST); //Get field list
   //If delimiter specified, then tokenize and place data in mapper record
  if (map.get(DELIMITER) != null) {
    StringTokenizer st = new StringTokenizer(record,(String)map.get(DELIMITER));
    for (Iterator fieldIterator = fieldList.iterator(); 
           fieldIterator.hasNext() && st.hasMoreElements(); ) {
      HashMap field = (HashMap)fieldIterator.next();
      rec.put(field.get(FIELD_NAME),(String)st.nextToken());
    }
  }
  //Have to parse record based on fixed lengths specified for each field
  else {
    for (Iterator i = fieldList.iterator(); i.hasNext(); ) {
      HashMap field = (HashMap)i.next();
      int start = (new Integer((String)field.get(START))).intValue();
      int end = (new Integer((String)field.get(END))).intValue();
      String str;
      try {
        str = ((String)record.substring(start-1,end));
      } catch (StringIndexOutOfBoundsException e) { //Reached end of record
        try { //Get remaining data
          str = ((String)record.substring(start-1));
        } catch (StringIndexOutOfBoundsException ex) {
          str = record;
        }
      }
      rec.put(field.get(FIELD_NAME),str);
    }
  }
  return rec;
}

If you create a Perl script for each file format your system handles, you should consider using this entity as an alternative to flat-file parsing. By placing the file format descriptions in an XML file, you can parse just about any file format -- comma delimited, pipe delimited, fixed length, and so on -- without writing any code (assuming the code in readRecord() can handle it). You'll save yourself from writing tedious custom code and you'll have the data records in a standard format to use with other business objects. Further, since the FileEntity object is a bidirectional map, it can also write data records in delimited or fixed-length format. The code for writeRecord() isn't shown above, but the code is just as straightforward; see this article's source code.

Read from and write to any table

The TableEntity also uses XML data mapping rules to place MapperRecords into a table called T_STAGING_TABLE, based on the field names keyed in the MapperRecord object and the column names supplied:

<!-- TableEntityList.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<tablemaps>
  <map name="table_map" table-name="T_STAGING_TABLE"
      driver="oracle.jdbc.driver.OracleDriver" 
      connect-string="jdbc:oracle:thin:user/pass@localhost:1521:DEV">
    <field name="id" column="MISC_CHAR1" column-type="String" />
    <field name="amount" column="MISC_NUM1" column-type="double" />
    <field name="date" column="MISC_DATE1" column-type="Date" date-
format="MMddyyyy" />
  </map>
</tablemaps>

The writeRecord() method for TableEntity uses a JDBC PreparedStatement to write the MapperRecord to the specified table:

//Write record to table
public void writeRecord(MapperRecord record) throws MapperException {
  try {
    ArrayList fieldList = (ArrayList)getFieldList();
    PreparedStatement ps = (PreparedStatement)getPreparedStatement(fieldList);
    int i=1;
    for (Iterator fieldIterator=fieldList.iterator(); fieldIterator.hasNext(); 
i++) {
      HashMap field = (HashMap)fieldIterator.next();
      String recordString = (String)record.get((String)field.get(FIELD_NAME)); 
      if (recordString == null) {
        recordString = "";
      }
      if (((String)field.get(COLUMN_TYPE)).equals(STRING)) {
        ps.setString(i,recordString);
      } else if (((String)field.get(COLUMN_TYPE)).equals(LONG)) {
        ps.setLong(i,Long.parseLong(recordString));
      } else if (((String)field.get(COLUMN_TYPE)).equals(DOUBLE)) {
        ps.setDouble(i,Double.parseDouble(recordString));
      } else if (((String)field.get(COLUMN_TYPE)).equals(DATE)) {
        ps.setDate(i,new java.sql.Date((parse(recordString,
           (String)field.get(DATE_FORMAT))).getTime()));
      } else if (((String)field.get(COLUMN_TYPE)).equals(TIMESTAMP)) {
        ps.setTimestamp(i,new java.sql.Timestamp((parse(recordString,
           (String)field.get(DATE_FORMAT))).getTime()));
      }
    }
    ps.execute();
    conn.commit();
    ps.close();
  } catch (Exception e) {
    e.printStackTrace();
    throw new MapperException("error writing to entity: "+tableMapName);
  }
}

Create mappings and manipulate data

The Mapper object is the module that ties the entities in the framework together. Based on XML, the object opens the proper entities, reads MapperRecords from the source entity, executes data-modifying tasks on each MapperRecord, and writes the MapperRecords to the target entity.

Data-modifying tasks are incorporated into this module so that you can manipulate MapperRecords before you write them to the target entity. For example, a convenient task for fields in a MapperRecord is to replace all occurrences of a string with another string:

public abstract class Task {
  public static String ALL="ALL";
  public String target = ALL; //Execute on all fields of MapperRecord by default
  public abstract MapperRecord execute(MapperRecord mapper);
  public void setTarget(String str) {
    target = str;
  }
}
public class ReplaceTask extends Task {
  // . . .
  public MapperRecord execute(MapperRecord record) {
    if (target.equals(ALL)) { //Execute task on all fields in MapperRecord
      for (Iterator i = (Iterator)record.keySet().iterator(); i.hasNext(); ) {
        String key = (String)i.next();
        record.put(key,replace((String)record.get(key),find,replace));
      }
    } else { //Execute task on only the fields specified in XML file
      StringTokenizer st = new StringTokenizer(target,",");
      for (; st.hasMoreTokens(); ) {
        String key = (String)st.nextToken();
        record.put(key,replace((String)record.get(key),find,replace));
      }
    }
    return record;
  }
  public void setFind(String str) {
    find = str;
  }
  public void setReplace(String str) {
    replace = str;
  }
}

The module creates the desired mapping by parsing the following XML file and maps data using Java reflection:

<!-- MapList.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<map-list>
  <map name="file_to_db_map">
    <read name="from_map" class="mapper.entity.FileEntity">
      <method name="setFileName">
        <arg type="java.lang.String" array-index="1" /> <!--used to 
pass in filename -->
      </method>
    </read>
    <tasks>
      <task class="mapper.task.ReplaceTask" target="amount,date">
        <task-method name="setFind" value="$" />
        <task-method name="setReplace" value="" />
      </task>
    </tasks>
    <write name="table_map" class="mapper.entity.TableEntity" />
  </map>
</map-list>

The mapping described here tells the Mapper object to execute the following steps:

  1. Open the FileEntity from_map and call setFileName with the first element of an array within the Mapper object. This method call is crucial to initializing the FileEntity and is why a setter method must be present in the FileEntity class.
  2. Open the TableEntity called table_map.
  3. While Mapper can still read records from the FileEntity:
    1. Read a MapperRecord from the FileEntity.
    2. Execute a ReplaceTask, which replaces all $ occurrences with a blank in the MapperRecord object's amount and date fields. The task-method element in the XML file calls the method specified under the name attribute and passes in the value attribute as a java.lang.String parameter. (Currently, only java.lang.String is supported for the parameter types of task setter methods).
    3. Write the MapperRecord to the TableEntity.
  4. Close both entities.

The Mapper module follows the above steps exactly:

//Main program
public static void main(String[] args) throws Exception {
  // . . .
  mapper.execute(args); 
}
//This function is the main control flow of the program.
public void execute(String mapperArgs[]) {
    String mapName = mapperArgs [0];
    Entity readEntity = null;
    Entity writeEntity = null;
     try {
    //Get entities, pass in args array for use with reflection code
    readEntity = (Entity)mapList.getEntity(mapName,Entity.READ, mapperArgs);
    writeEntity = (Entity)mapList.getEntity(mapName,Entity.WRITE, mapperArgs);
 
    //. . .
    
    //Open entities for reading and writing
    readEntity.open();
    writeEntity.open();
     //For each read record, write record to write entity
    MapperRecord record;
    while ((record = (MapperRecord)readEntity.readRecord()) != null) {
      //. . . 
      //Execute any tasks before writing
      ArrayList tasks = (ArrayList)mapList.getTasks(mapName);
      if (tasks != null) { //If tasks to execute, then execute them
        try {
          Iterator taskIterator = (Iterator)tasks.iterator();
          while (taskIterator.hasNext()) {
            Task t = (Task)taskIterator.next();
            record = t.execute(record);
          }
        } catch (Exception e) {
        }
      }
      writeEntity.writeRecord(record);
    }
    //Close entities
    writeEntity.close();
    readEntity.close();
  } catch (Exception e) {
    e.printStackTrace();
  }
}

You can execute the file_to_db_map from either the command line or programmatically by specifying the desired mapping and filename to read from:

java mapper.Mapper file_to_db_map /tmp/from.txt

or

Mapper mapper = new Mapper();
String[] args = {"file_to_db_map","/tmp/from.txt");
mapper.execute(args);

As you can see, to create mappings between entities, you simply edit XML files without adding and recompiling code. If a mapping's requirements ever change, such as the destination entity changing from a database table to a file, simply create the file format in the FileEntityList.xml file and edit the MapList.xml file to use this new format. That's all you need to create new execution paths to other data entities!

Performance and limitations

The framework's overall performance is mainly a function of the readRecord() and writeRecord() methods. The number and efficiency of the data-modifying tasks also influence how fast a given mapping will occur. To be more specific, the mapping between a comma-delimited file and fixed-length file is relatively fast -- around 10,000 records per second on a Wintel machine with 1.5 Ghz and 256 MB RAM. However, mappings to databases are much slower, as the database's physical network distance and network latency act as the primary bottlenecks. Also, the current architecture assumes that MapperRecords contain java.lang.String objects only; therefore, adding support for other data types will most likely slow down the mapping process.

The Mapper object assumes that records being read from a data source are eventually exhausted, like in a file or a database table. If the source data entity is a system service, like a stateless Enterprise JavaBean (EJB), or a message-based business object, the notion of the data stream ending with these entities doesn't exist. If this is the case, you can create a MapperRecord manually, open the target entity from within these services directly, write the record, and finally, close the entity.

The framework's other minor limitations include lack of header and trailer support for flat files, and lack of support for some database data types. However, extending this design to handle these shortcomings is not difficult. Further, the framework fits well with the Java Connection Architecture (JCA) and Web services, just as well as it does with files and JDBC (Java Database Connectivity). For example, assuming you created a JCAEntity and WebServiceEntity, you could easily translate data from an enterprise information architecture and write to a Web service.

Sharing a common language

The Mapper framework yields seamless data integration by translating the languages of different data sources into a common interlingual form. The architecture lets you transfer data between databases, files, and nearly any other data source. As you add other entities and requirements evolve over time, you significantly lower development, maintenance, and testing hurdles with this robust framework.

(I conceived and implemented the idea for this framework at Stockback, LLC.)

Snehal Patel has worked with Java and object-oriented designs for almost five years. He has served as a teaching assistant for introductory computer science courses in Java at his alma mater, Williams College. He's currently an independent consultant living in New York City.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more