Groovy ETLs with Scriptella

The open source BI, reporting, and ETL space is bursting at the seams with capable tools. In fact, I recently had a conversation with Tim Berglund regarding open source business intelligence where he specifically mentions two highly regarded tools: Pentaho and Talend. These tools do everything under the sun when it comes to BI, including ETL. Nevertheless, when it comes to simple ETL alone, which stands for Extract-Transform-Load (basically meaning take data from one source, do something to it, and then load it into another source) I tend to lean on Scriptella.

Scriptella is an open source ETL tool that integrates easily with Ant, works with any JDBC driver, and allows for integrating languages like Groovy. Thus, Scriptella quickly facilitates automating an ETL process quite nicely. For example, imagine having some financial data represented in one table. To make life a bit easier for consumers of that data, you’d like to create a new table containing some derived data from the original values. For instance, the new table could store total order prices including tax, or converting order prices into another format (such as Euros) as opposed to storing each value or derivation in a separate column, which places the brunt of work on the consumer at query time.

Executing this transformation with Scriptella, sprinkled with a pinch of Groovy couldn’t be easier. All you need to do is logically map your work-flow — in this case, I’m going to perform three steps. First, run a select statement, which returns some data. Next, I’m going to take that data and calculate a value. Finally, I’m going to take that value and insert it into another table.

Scriptella makes heavy usage of XML. While XML isn’t so hip anymore, it’s still easy enough to use and I’m sure soon enough, someone with wrap everything up in a nice DSL. Regardless, getting started with Scriptella requires creating an XML file and defining datasources. Datasources can be databases, files, and even languages, such as Groovy (via JSR 223). They are defined using the connection element like so:

<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
	<description>...</description>
	<properties>
		<include href="etl.properties" />
	</properties>
	<connection id="in" driver="$driver" url="$url" user="$user"
		password="$password" classpath="$classpath" />

    <connection id="groovy" driver="script">language=groovy</connection>
...
</etl>

In the code above, I’m loading a properties file, which defines JDBC properties, which I can then reference with a $ sign. I use those properties to define a connection and give it a logical name of “in”; plus, I define another connection named “groovy”, which as you’ve probably guessed means I can use Groovy to code some transformations.

Next, I have to define the work-flow (which I delineated earlier), which means defining query blocks and script blocks. The difference between the two elements is subtle; suffice to say, you use script blocks to update things, such as writing to a file or database. Conversely, you use query blocks to read data.

Accordingly, I have to define a query block to select my data, another query block inside that block (which means, for each row, execute this block) for executing my Groovy transformation, and then a script block inside that 2nd block to write the corresponding result to a table. Embedding query or script blocks inside each other implies iteration. Watch:

<query connection-id="in">
 SELECT ID, UPB, TERM, NR, PTR
 FROM FOO

 <query connection-id="groovy">
  <![CDATA[
   import com.acme.impl.IRRService

   irr = FinancialService.IRR(UPB, TERM, NR, PTR, 0.005)
   query.next()
  ]]>	

   <script connection-id="in">
   UPDATE BAR
    SET IRR = ?irr
   WHERE ID = ?ID
  </script>
 </query>
</query>

As you can see in the code above, The first query block defines a select statement — all selected column values are available to sub-blocks defined in Groovy, for instance, by directly referencing their name (i.e. ID, UPB). Thus, in the next sub-block, some simple Groovy is used to calculate a value. Scriptella requires you invoke query.next() in these simple scripts to force sub-blocks to execute, which in this case, is a script block that in turn updates another table with the newly calculated value.

Finally, all of this is simply executed via, say Ant, by using the etl task like so:

<target name="irr-etl">
 <etl file="./src/etl/irr.etl.xml" debug="${scriptella.debug}" />
</target>

There are myriad tools in the ETL, reporting, and general BI space — open source and commercial; however, if all you need is a simple ETL tool that integrates nicely with something like Ant and makes use of alternate JVM languages like Groovy, then Scriptella is your tool, baby.

Looking to spin up Continuous Integration quickly? Check out www.ciinabox.com.