Batch Inserts and the Art of the Flush

Hibernate batch inserts look simple on a slide, then your app crawls the moment you try to push a few hundred thousand rows. I spent a late night nudging a job while watching chatter about new features on Google Plus, and the culprit was not the database. It was the art of the flush.

Why do batch inserts feel slow?

Because the Session is a pack rat. Every new entity goes into the first level cache. No flush means more stuff in memory, more dirty checking, and more time spent thinking rather than writing. The driver can also send one statement at a time if you do not nudge it.

What actually happens when Hibernate flushes?

On flush, Hibernate figures out insert order, builds JDBC batches if allowed, and pushes them to the driver. With a huge Session, that planning step is heavy. The trick is to flush early and clear often. Keep the Session light, or it will fight back.

Which knobs should I turn first?

Turn on JDBC batching and sorted inserts. Then check stats. Small changes here bring big wins. Here is a minimal set that pays rent.

# hibernate.properties or hibernate.cfg.xml equivalent
hibernate.jdbc.batch_size=50
hibernate.order_inserts=true
hibernate.order_updates=true
hibernate.generate_statistics=true

# Optional for optimistic lock updates
hibernate.jdbc.batch_versioned_data=true

How do I insert a big list without melting memory?

Use a loop, call flush and clear every N rows. This keeps the first level cache small and lets the driver batch work. You get steady memory and steady throughput.

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

int batchSize = 50;
int i = 0;

for (MyEntity e : entities) {
  session.save(e);
  if (++i % batchSize == 0) {
    session.flush();
    session.clear();
  }
}

tx.commit();
session.close();

Do identity columns ruin batching?

They can. With IDENTITY generators the driver needs the key for each insert right away, so Hibernate often falls back to one by one inserts. SEQUENCE and TABLE strategies batch much better. If you control the schema, prefer sequences. If you must keep identity, consider a StatelessSession.

Should I use StatelessSession for bulk work?

For pure inserts with no cascade or lazy fields, yes. StatelessSession does not track the first level cache or dirty checking. It is boring and fast for bulk writes. Just remember it will not trigger lifecycle events the way a normal Session does.

StatelessSession s = sessionFactory.openStatelessSession();
Transaction tx = s.beginTransaction();

for (MyEntity e : entities) {
  s.insert(e);
}

tx.commit();
s.close();

Is my driver batching for real?

Some drivers accept addBatch calls then still send work row by row. For MySQL you want rewriteBatchedStatements=true in the JDBC url or it will not group values. PostgreSQL batches well but still sends multiple statements in a single round trip. Oracle and SQL Server do fine with JDBC batching in most common cases.

jdbc:mysql://localhost:3306/app?useUnicode=true&characterEncoding=utf8&rewriteBatchedStatements=true

What batch size should I pick?

Start with 30 to 50. Move up to 100 if the driver and database smile. Past that point many systems show returns that fade or even slow down due to network buffers and server parsing.

How do I keep transactions sane?

Wrap chunks in a transaction. Commit after each chunk. If one batch fails you do not lose the entire job. On a single huge commit the redo logs grow and lock time grows. With chunked commits you keep the pipeline moving.

int chunk = 1000;
int i = 0;

Transaction tx = session.beginTransaction();
for (MyEntity e : entities) {
  session.save(e);
  if (++i % chunk == 0) {
    session.flush();
    session.clear();
    tx.commit();
    tx = session.beginTransaction();
  }
}
session.flush();
session.clear();
tx.commit();

How do I see what is really going on?

Turn on SQL and bind logging. Check Hibernate statistics. You want to see batch sizes near your setting and low entity count in the Session between flushes.

# log4j.properties
log4j.logger.org.hibernate.SQL=DEBUG
log4j.logger.org.hibernate.type.descriptor.sql=TRACE
log4j.logger.org.hibernate.engine.jdbc.batch.internal.BatchingBatch=DEBUG

What about insert order and foreign keys?

Hibernate can order inserts so parents go first. That saves round trips from constraint failures. Keep cascades simple in bulk jobs. If you need to import children, stage the data or insert parents first then children in a second pass.

How do I avoid surprises with large payloads?

Watch for triggers and default values. They can punch holes in plans. Keep the entity lean for bulk paths, avoid eager relationships, and prefer plain inserts with simple types. If you must fill a blob, stream it and keep the batch size small for that step.

Can I mix reads and writes in the same run?

You can, but mix with care. Reads fill the first level cache. Do bulk reads with a separate Session or scroll results. Then write with a fresh Session so your flush is clean and quick.

Wrap up

Batch inserts fly when you keep the Session lean, pick a friendly id strategy, nudge the driver, and flush with intent. Start with a small batch size, flush and clear in a loop, and watch the logs. If the job still drags, move to StatelessSession for the bulk step. Simple moves, real speed.

General Software Software Engineering