Open source Java projects: Spring Batch

Reading and writing CSV files with Spring Batch and MySQL

Implementing a batch process to handle gigabytes of data is a veritable tsunami of a task, but you can take it down a chunk with the help of Spring Batch. This popular Spring module has been engineered to handle the details of batch processing for all sorts of files. Get started with Spring Batch by building a simple job that imports products from a CSV file into a MySQL database, then explore the module's batch processing capabilities with a single or multiple processors and one or more helpful tasklets. Finally, get a quick overview of Spring Batch's resiliency tools for skipping records, retrying records, and restarting batch jobs.

If you've ever had to implement a batch process to pass hundreds of thousands of data elements between Java enterprise systems, then you know what a load of work that is. Your batch processing system needs to be able to handle huge amounts of data, handle the failure of individual records without crashing the entire process, and manage interruptions and restarts without having to re-do what's already been done.

For the uninitiated, here are some scenarios that require batch processing, and where using Spring Batch could potentially save you countless hours:

  • You receive a file that is missing some information, so you parse through the file, call a service to retrieve the missing information, and write the file out for another batch process to handle.
  • When an error occurs in your environment, you write the failed message to your database. You have a process that looks for failed messages every 15 minutes and replays ones that you've identified as replayable.
  • You have a workflow in that you expect other systems to call certain services, in addition to events received. If those other systems do not call your services then you automatically clean up your data after a couple days so that the business processes do not fail.
  • You receive a file every day that contains employee updates and you need to create artifacts for new employees.
  • You have services that can be used to customize orders. Every night you run a batch process that constructs manifest files and sends them to your fulfillment vendors.

Jobs and chunks: The Spring Batch paradigm

Spring Batch has a lot of moving parts, but let's start by looking at the core processing that you'll do in a batch job. You can think about the work in a job as following three distinct steps:

  1. Reading
  2. Processing
  3. Writing

For example, you might open a data file in CSV format, perform some processing on the data in the file, and then write the data to a database. In Spring Batch, you would configure a reader to read one line of the file at a time and pass each line to your processor; the processor would collect and group the results into "chunks" and send those records to a writer, which would insert them into the database. You can see the cycle in Figure 1.

Figure 1. The basic logic of Spring Batch processing

osjp spring batch fig1

Spring Batch simplifies batch processing greatly by providing implementations of readers for common input sources like CSV files, XML files, databases, JSON records contained in a file, and even JMS, as well as writers. It's also fairly simple to build custom readers and writers if you need to.

To get started, let's look at the process to configure a file reader to read a CSV file, map its contents to an object, and insert the resulting objects into a database.

Reading and processing a CSV file

Spring Batch's built-in reader, org.springframework.batch.item.file.FlatFileItemReader, parses a file into individual lines. It requires a resource that references the flat file, the number of lines to skip at the beginning of the file (typically just the file headers), and a line mapper that converts an individual line into an object. The line mapper, in turn, requires a line tokenizer that divides a line into its constituent fields, and a field set mapper that builds an object from the set of field values. The configuration for FlatFileItemReader is shown below:

Listing 1. A Spring Batch config file


    <bean id="productReader" class="org.springframework.batch.item.file.FlatFileItemReader" scope="step">

        <!-- <property name="resource" value="file:./sample.csv" /> -->
        <property name="resource" value="file:#{jobParameters['inputFile']}" />

        <property name="linesToSkip" value="1" />

        <property name="lineMapper">
            <bean class="org.springframework.batch.item.file.mapping.DefaultLineMapper">

                <property name="lineTokenizer">
                    <bean class="org.springframework.batch.item.file.transform.DelimitedLineTokenizer">
                        <property name="names" value="id,name,description,quantity" />
                    </bean>
                </property>

                <property name="fieldSetMapper">
                    <bean class="com.geekcap.javaworld.springbatchexample.simple.reader.ProductFieldSetMapper" />
                </property>
            </bean>
        </property>
    </bean>

Let's look at these components. First, Figure 2 shows a diagram of the relationship between them.

Figure 2. Components of FlatFileItemReader

osjp spring batch fig2

Resources: The resource property defines the file to read. The commented-out resource show the path to an absolute file, which is sample.csv in the same directory in which the batch job is run. The more interesting entry is the inputFile job parameter: job parameters allow you to specify parameters at runtime to affect the job. In the case of the import file, it's a very important parameter to resolve at runtime rather than at build time. (It would be pretty boring to import the same file over and over again!)

Lines to skip: The linesToSkip property tells the file reader how many leading lines in the file to skip. Quite often CSV files will contain header information, such as column names, in the first line of a file, so in this example we tell the file reader to skip the first line.

Line mapper: The lineMapper is responsible for converting individual lines of a file into objects. It depends on two components:

  • lineTokenizer defines how to break the line up into tokens. In our case we list the names of the columns in the CSV file.
  • fieldSetMapper builds an object from field values. In our case we build a Product object from the id, name, description, and quantity fields.

Note that Spring Batch provides the infrastructure for us, but we're still responsible for the logic in the field set mapper. Listing 2 shows the source code for the Product object, which is the object we're building.

Listing 1. Product.java


package com.geekcap.javaworld.springbatchexample.simple.model;

/**
 * Simple POJO to represent a product
 */
public class Product
{
    private int id;
    private String name;
    private String description;
    private int quantity;

    public Product() {
    }

    public Product(int id, String name, String description, int quantity) {
        this.id = id;
        this.name = name;
        this.description = description;
        this.quantity = quantity;
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getDescription() {
        return description;
    }

    public void setDescription(String description) {
        this.description = description;
    }

    public int getQuantity() {
        return quantity;
    }

    public void setQuantity(int quantity) {
        this.quantity = quantity;
    }
}

The Product class is a simple POJO that wraps our four fields. Listing 2 shows the source code for the ProductFieldSetMapper class.

Listing 2. ProductFieldSetMapper.java


package com.geekcap.javaworld.springbatchexample.simple.reader;

import com.geekcap.javaworld.springbatchexample.simple.model.Product;
import org.springframework.batch.item.file.mapping.FieldSetMapper;
import org.springframework.batch.item.file.transform.FieldSet;
import org.springframework.validation.BindException;

/**
 * Builds a Product from a row in the CSV file (as a set of fields)
 */
public class ProductFieldSetMapper implements FieldSetMapper<Product>
{
    @Override
    public Product mapFieldSet(FieldSet fieldSet) throws BindException {
        Product product = new Product();
        product.setId( fieldSet.readInt( "id" ) );
        product.setName( fieldSet.readString( "name" ) );
        product.setDescription( fieldSet.readString( "description" ) );
        product.setQuantity( fieldSet.readInt( "quantity" ) );
        return product;
    }
}

The ProductFieldSetMapper class extends FieldSetMapper, which defines a single method: mapFieldSet(). Once the line mapper has parsed the line into its individual fields, it builds a FieldSet, which contains the named fields, and passes that to the mapFieldSet() method. This method is responsible for building an object to represent that row in the CSV file. In our case, we build a Product instance by invoking the various read methods on the FieldSet.

Writing to the database

After we've read the the file and have a set of Products, the next step is to write it to the database. Technically we could wire in a processing step that does something to the data, but for now let's just write the data to the database. Listing 3 shows the source code for the ProductItemWriter class.

Listing 3. ProductItemWriter.java


package com.geekcap.javaworld.springbatchexample.simple.writer;

import com.geekcap.javaworld.springbatchexample.simple.model.Product;
import org.springframework.batch.item.ItemWriter;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.RowMapper;

import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.List;

/**
 * Writes products to a database
 */
public class ProductItemWriter implements ItemWriter<Product>
{
    private static final String GET_PRODUCT = "select * from PRODUCT where id = ?";
    private static final String INSERT_PRODUCT = "insert into PRODUCT (id,name,description,quantity) values (?,?,?,?)";
    private static final String UPDATE_PRODUCT = "update PRODUCT set name = ?, description = ?,quantity = ? where id = ?";

    @Autowired
    private JdbcTemplate jdbcTemplate;

    @Override
    public void write(List<? extends Product> products) throws Exception
    {
        for( Product product : products )
        {
            List<Product> productList = jdbcTemplate.query(GET_PRODUCT, new Object[] {product.getId()}, new RowMapper<Product>() {
                @Override
                public Product mapRow( ResultSet resultSet, int rowNum ) throws SQLException {
                    Product p = new Product();
                    p.setId( resultSet.getInt( 1 ) );
                    p.setName( resultSet.getString( 2 ) );
                    p.setDescription( resultSet.getString( 3 ) );
                    p.setQuantity( resultSet.getInt( 4 ) );
                    return p;
                }
            });

            if( productList.size() > 0 )
            {
                jdbcTemplate.update( UPDATE_PRODUCT, product.getName(), product.getDescription(), product.getQuantity(), product.getId() );
            }
            else
            {
                jdbcTemplate.update( INSERT_PRODUCT, product.getId(), product.getName(), product.getDescription(), product.getQuantity() );
            }
        }
    }
}

The ProductItemWriter class extends ItemWriter and implements its single method: write(). The write() method accepts a list of Products. Spring Batch implements its writers using a "chunking" strategy, which means that while reads are performed one item at a time, writes are chunked together into groups. In the job configuration, which is defined below, you have full control over the number of items that you want chunked together (through the commit-interval) into a single write. In this example, the write() method does the following:

  1. It executes an SQL SELECT statement to retrieve the Product with the specified id.
  2. If the SELECT returns an item then write() performs an update to update the database record with the new values.
  3. If the SELECT does not return an item then write() performs an INSERT to add the product to the database.

The ProductItemWriter class uses Spring's JdbcTemplate class, which is defined in the applicationContext.xml file below and automatically wired into the ProductItemWriter class. If you haven't used the JdbcTemplate class, it is an implementation of the Gang of Four template design pattern for interacting with databases behind a JDBC interface. The code should be pretty self-explanatory, but if you need more information, check out the Spring JdbcTemplate javadoc.

Wiring it together in the application context file

Thus far we have built a Product domain object, a ProductFieldSetMapper that converts a line in the CSV file into an object, and a ProductItemWriter that writes objects to the database. Now we need to configure Spring Batch to wire all of these together. Listing 4 shows the source code for the applicationContext.xml file, which defines our beans.

1 2 3 4 Page 1
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.