Earlier this week, I wrote some data pump code to move information between DB2 and Domino, filtering and cleaning it on the way. I wrote the code using JPA 2 so that I’d have less code to write, and so that I could more easily turn it over to someone else some day and not have a lot of questions to answer.
Performance seemed really slow, so I decided to measure just how slow. I grabbed my stopwatch class, hacked it in, replaced the output database with a file, and looked at the results in seconds taken per thousand records read:
JPA 2: 81.54 s/krec 77.82 s/krec 81.89 s/krec 89.60 s/krec
Taking 81 seconds per thousand records might seem pretty good, but it meant my data pump was going to take over 5 hours each night. And that assumed multiple threads so that the writing part of the problem didn’t slow it down any further.
At this point, I could have done the Rails hipster thing, and said “Oh dear, relational databases are slow and useless, let’s use NoSQL”, but instead I decided to investigate further. To find out how much of the overhead was down to JPA 2, I hacked together some raw JDBC code to compare:
JDBC: 0.42 s/krec 0.42 s/krec 0.42 s/krec 0.42 s/krec
So, about 200x faster. Not looking good for JPA 2 and EclipseLink. I turned up the logging level, and took a look at the actual SQL being executed.
The actual column retrieval part of the query looked fine; exactly how I’d written it in the JDBC code, in fact. The problem is that it was wrapped in this:
SELECT * FROM ( SELECT * FROM ( SELECT EL_TEMP.*, ROWNUMBER() OVER() AS EL_ROWNM FROM ( sensible query goes here ) AS EL_TEMP) AS EL_TEMP2 WHERE EL_ROWNM <= ?) AS EL_TEMP3 WHERE EL_ROWNM > ?
This was being done to chop the result table into reasonable slices. I was using a wrapper class which called JPA’s setFirstResult()/setMaxResult() to page through the results, cache them, and hand the objects back to my code. JPA was then generating the nested queries to implement this pagination. Timing the paginated query revealed that it was responsible for the slowdown.
I couldn’t eliminate paging; the dataset was just too big. So I tried altering the code to make the cache and page size much larger. That helped a bit, and allowed me to get the delay down to 4.21s/krec. Still 10x slower than raw JDBC, but 20x faster than it had been.
I looked around for some information about how to best do result pagination with DB2, and found a developerWorks article. I used the techniques from that article to put together a paged query equivalent to the one EclipseLink was generating, ran them both through DB2′s command-line performance measurement tool, and verified that both gave the same output.
The EclipseLink query was 3x faster than the new one.
Next, I decided to see how fast the DB2 command line client could dump the data. It was only slightly faster than the raw JDBC; 0.399 s/krec rather than 0.42.
I reached several conclusions.
Firstly, EclipseLink is pretty good from a performance point of view. Given the problem of issuing paginated queries against a database, chances are EclipseLink will do better than a random non-expert programmer writing raw SQL.
However, that might not be good enough. Pagination can be expensive, so if you’re batch-processing data you might be much better off avoiding ORM entirely. So like most things in software development, ORM is not always the right answer.
I’m also disappointed that there’s still no good way to use JPA to get your query results in Iterator form, so that pagination becomes unnecessary.