Retries, DLQs, and Idempotency

Retries, DLQs, and Idempotency: a night in the trenches

My pager went off at 2 AM. Our nightly reports were stuck. The CRM team could not see yesterday’s sales. The queue depth graph looked like a hockey stick and the dead letter queue was blinking like a Christmas tree. This was not a big data war story, just plain old JMS doing what it does when something goes sideways: retries without mercy.

We had a new validation rule on customer emails. One bad message slipped in and every retry hit the same exception. The broker did its job, delivered again and again until it gave up and parked the message in the dead letter queue. Meanwhile, redeliveries slowed the whole pipe. The CFO did not care about our excuses. They wanted their numbers before breakfast.

We fixed the bug, drained the dead letter queue, and went back to sleep. The next day we baked in what we should have had from day one: clear retry rules, a sane dead letter path, and idempotent consumers. This post is a field guide so you do not meet my 2 AM friend.

Technical middle: how JMS actually behaves when things fail

Retries come from two places: your code and the broker. With JMS, the big switch is whether the session is transacted. In a transacted session, you call commit when you are done. If you call rollback or an exception bubbles up and the session rolls back, the broker will redeliver the message later.

Two headers help you reason about this:

JMSRedelivered: a boolean set by the provider when the message is being redelivered.
JMSXDeliveryCount: an optional integer property that some providers set. ActiveMQ sets it. WebSphere MQ has similar info through provider specific properties.

When you mix client side retries with broker retries, you create storms. Keep one source of truth. I prefer broker driven redelivery paired with transacted consumers. Set a cap on redeliveries and a back off so you do not pound a broken downstream.

ActiveMQ redelivery policy with Spring

<bean id="connectionFactory" class="org.apache.activemq.ActiveMQConnectionFactory">
  <property name="brokerURL" value="tcp://localhost:61616"/>
  <property name="redeliveryPolicy">
    <bean class="org.apache.activemq.RedeliveryPolicy">
      <property name="useExponentialBackOff" value="true"/>
      <property name="initialRedeliveryDelay" value="1000"/>        <!-- 1 second -->
      <property name="maximumRedeliveries" value="5"/>
      <property name="backOffMultiplier" value="2.0"/>
      <property name="maximumRedeliveryDelay" value="60000"/>        <!-- cap at 60 seconds -->
    </bean>
  </property>
</bean>

ActiveMQ sends messages that exceed the count to ActiveMQ.DLQ by default. WebSphere MQ uses SYSTEM.DEAD.LETTER.QUEUE. TIBCO EMS and others have similar names. Your first alert in production should be any non zero depth in the DLQ.

Transacted consumer pattern

public class OrdersListener implements javax.jms.MessageListener {

  private final OrderService service;

  public OrdersListener(OrderService service) {
    this.service = service;
  }

  @Override
  public void onMessage(Message msg) {
    try {
      TextMessage tm = (TextMessage) msg;
      String json = tm.getText();
      service.process(json); // may throw
      // commit happens outside if using container managed transactions
      // or call session.commit() if you manage the session
    } catch (TransientException e) {
      // let the container roll back so the broker redelivers with back off
      throw e;
    } catch (PermanentException e) {
      // poison message, send to DLQ on purpose if you own that policy
      // else mark as handled and commit to avoid loops
      log.error("Permanent failure, sending to DLQ", e);
      sendToDlq(msg);
      // commit so the broker does not redeliver
    } catch (Exception e) {
      // unknown, treat as transient for safety
      throw new RuntimeException(e);
    }
  }
}

Key point: only throw when you want a retry. Commit when you are done or when you want to stop retries.

Poison message detection

int attempts = msg.propertyExists("JMSXDeliveryCount")
  ? msg.getIntProperty("JMSXDeliveryCount")
  : 1;

if (attempts >= 5) {
  // stop the loop, route aside
  sendToDlq(msg);
  // commit
}

You can let the broker move it to the DLQ after the maximum or do it yourself. I like to set the broker policy and keep my consumer simple.

Idempotency: make at least once safe

JMS gives you at least once delivery. Exactly once is a fantasy under most conditions. The cure is idempotent consumers. Process the same message twice and get the same end state. The trick is a stable key per message and a fast check to see if you have processed it.

Ask producers to set a custom property like IdempotencyKey. Use a natural key if you have one, for example order id plus version. Store keys in a table with a unique constraint and win the race with the database.

CREATE TABLE processed_messages (
  key VARCHAR(128) PRIMARY KEY,
  processed_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

public void process(String json, Message msg) {
  String key = msg.getStringProperty("IdempotencyKey");
  if (key == null || key.isEmpty()) {
    // fall back to JMSMessageID if you must
    key = msg.getJMSMessageID();
  }

  try {
    insertKey(key); // INSERT INTO processed_messages(key) VALUES (?) 
                    // fails on duplicate
    // do the side effects now
    apply(json);
  } catch (DuplicateKeyException dup) {
    // already processed, skip side effects
    log.info("Duplicate, skipping key {}", key);
  }
}

This pattern is boring and that is the goal. No mystery retries. No phantom double charges. Your consumers can crash after the insert and before the side effect, so place the insert and the side effect in the same transaction whenever you can. If the side effect spans systems, pick the one that is easiest to correct and make it idempotent too, for example use an upsert by key or treat the operation as a set.

Producer side hints

Set a strong key: IdempotencyKey or a domain key.
Set JMSTimestamp, it helps with time based rules and debugging.
Use JMSCorrelationID to tie request and response.

Manager view: what to ask your team

You do not need to read every line of Spring XML. You do need to ask for three things and hold the line.

Clear retry policy. Max attempts, back off plan, and who decides when to stop. This must be in code and in a runbook.
Dead letter queue policy. Where do poison messages go, who owns that queue, how fast do we triage, what is the playbook for replay or drop.
Idempotency. Ask for proof. A demo that sends the same message twice and shows a single charge or a single row created.

Add alerts on DLQ depth, on retry spikes, and on consumer lag. Wire these to something that wakes a human. Pager or phone, your choice. A slow retry storm can eat your night and your budget. Back off policies reduce heat on shared systems like payment gateways and CRMs. If a partner is down, switch to a queue that holds work for later and stop hammering a broken endpoint.

Do a short postmortem every time the DLQ gets a visitor. One page. What failed, how we spotted it, what we changed. Reward teams that delete alerts by fixing root causes. You will see fewer 2 AM calls and your data quality will improve without a big project.

Your challenge for the next two days

Inventory queues. List your top five JMS destinations. Write down the DLQ name for each.
Set caps and back off. If you run ActiveMQ, set maximumRedeliveries and exponential back off. If you run WebSphere MQ or EMS, apply the equivalent.
Add an IdempotencyKey to one producer. Roll out the duplicate key table and the consumer check for one flow.
Alert on DLQ depth. Page when depth is greater than zero for more than five minutes.
Run a fire drill. Put a known poison message on a test queue. Watch it move to the DLQ. Practice replay with a one liner.

Quick replay tip

// ActiveMQ Artemis or 5.x style pseudo code
Message m = dlqConsumer.receive(1000);
if (m != null) {
  // strip poison headers so it does not jump back to DLQ
  m.clearProperties();
  producer.send(liveQueue, m);
}

Final thought: last week Google put Wave on ice. That is a reminder that shiny tools come and go, while messages keep flowing. Good queues are boring. Make your retries gentle, your DLQs loud, and your consumers idempotent. When the pager rings, you will already know what to do.

General Software Software Engineering