Merging lists in Spring Batch jobs to reduce data look up time

Merging Spring Batch jobs to reduce Data look up time

When designing a batch job, a common requirement is to take a list of records and enhance them with data from another source.

There are a couple of ways to achieve this. One way would be to read a record from the main list and then gather the additional information for that record before reading the next record. This gathering would typically be done in an ItemProcessor class.

This is fine when there is a relatively small number of records to process, but for a larger dataset, it may be more efficient to retrieve the data as two (or more) lists and merge them together.

In one example I had recently, a job needed to read 15 million records from a CSV file and enhance each one with some values from corresponding records in a database. The database lookup for each record would have taken about 1 millisecond, so to process the 15 million records would have taken over 2 hours if the job was run using a single thread. Running a query to retrieve all the additional values in one go and merge the results with the CSV file records, reduced that time to about 20 minutes – a reduction of more than 80%. Using multiple threads would have helped with the overall time for either solution, but you get the idea.

Let’s see how I performed the merging. This relies on the different lists being in the same order according to whatever key you will be using to match records in the different lists.

The merging is performed by creating a custom reader that utilises a helper class to read items from an ItemReader looking for records for the next key or accumulating them for a particular key. The helper class looks like this:

[prism field=Helper_Class language=java]

Okay, there’s quite a bit of code here. The class gives you methods to read the items for the next key in the data, to read the first item for a particular key of all items for a key. I have found that this covers all of my requirements.

Let’s look at how you would use it. Imagine a situation where there is a list of PrimaryObj objects that need to be merged with a list of SecondaryObj objects.

To create the concrete classes you’ll need an accumulator class for each model class – like this for PrimaryObj and something similar for SecondObj:

[prism field=Concrete_classes language=java]

In this case the key used to match the records is a Long value, but it could be any class that implements Comparable.

Here’s an example of a reader class that reads a list of PrimaryObj records for the next key value, and also reads any SecondaryObj records for that key. Both lists are added to a MergeWrapper object that is returned by the reader:

[prism field=Merger_wrapper language=java]

The reader just gathers the matching records from both sources together into a wrapper object. It would be possible to combine the lists of records together in the reader, but I prefer to do that in a processor – in my view the reader should just read the data.

It is possible to handle three or more data sources that need to be merged. You just need to change the reader to gather the records from however many sources you have and store them in the wrapper object.

To Conclude..

This approach works best if the lists of items to be processed are of a similar size.

If the secondary list is many times the size of the primary list, you will end up reading and ignoring a lot of data. If this is from a database, you may be able to tweak the query to avoid reading unnecessary rows, but if the data is from a flat file, there is little you can do but read through the entire file. In fact, if the secondary data is from a flat file, unless you are able to load that flat file into a database in a separate step in the job, this is the only option as flat files don’t support random access.

Using this approach has been beneficial for me using both databases and flat files. But it can only work if all the data sources are read in in the same order and have the same common key that can be used to match items. If this is not the case, then you will have to look up the additional data required.



Photo by Steve Johnson on Unsplash

Merging Spring Batch jobs to reduce Data look up time
By Jeremy Yearron
7 January 2019
JavaThe Good Systems Blog

Share this post

Catt to action

Amet aliquam id diam maecenas ultricies mi eget mauris

Lorem ipsum dolor sit amet, consectetur elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.