Clustering on JBoss: Session and State - CMO & CTO (An AI Generated Experiment to the past)

Clustering on JBoss is not a magic lever you flip and walk away. It is a bundle of choices about session and state that either keep your app snappy or melt your heap. I have been wrangling JBoss 4.2 with Apache out front and the story is repeatable enough to share.

Servers come and go. Users do not care. They want their cart, their wizard step, their login kept alive. That is where HTTP session replication and Stateful Session Beans come into play. Same words. Very different costs.

What are we trying to cluster?

There are two kinds of state you will juggle. Web session state in the servlet container, and EJB state inside the app server. JBoss can cluster both with JGroups and JBossCache under the hood.

Tomcat inside JBoss can replicate the HTTP session. The EJB container can replicate SFSB caches. Both rely on objects being Serializable and on the same classes across nodes.

Do we really want session replication or just stickiness?

The easy win is sticky sessions with Apache mod_jk or AJP. Add a jvmRoute to each JBoss node, send a user to one node, and keep them there while it lives. When a node dies, the user shifts and loses session unless you also replicate.

If your app can reload the cart or the last screen with a lightweight call, sticky only is fine. It keeps the network quiet and your GC calmer. If losing state is a support call, you want session replication as a safety net.

How heavy is HTTP session replication on JBoss?

JBoss uses JBossCache over JGroups. You choose sync or async replication, and whole session or attribute level. Sync gives stronger guarantees but adds latency. Async is faster but you can drop a few last updates on a crash.

There is also buddy replication where each node keeps a hot backup on a neighbor instead of blasting the whole cluster. For medium traffic this is a sweet spot. Fewer wires on fire. Quicker failover.

What makes a session safe to replicate?

Everything in the session must be Serializable. That includes nested fields. If a single field is not, replication fails or silently skips it and you get weird bugs on failover.

Keep the session small. Store IDs not full graphs. Cache read only data outside or pull from the database when needed. Big sessions kill throughput, flood the wire, and make GC sad.

How do we wire sticky sessions the right way?

Set a unique jvmRoute for each node and make sure Apache sends the same user back to that route. The cookie will carry the route. If the route tag does not match a live node, the connector will pick another one and you get a soft failover.

Make sure your web app is marked distributable so Tomcat knows it can replicate. Roll the same exact WAR on all nodes. Same classes, same config, same build. Mixed bits break deserialization.

What about Stateful Session Beans?

SFSB can track a conversation for you and JBoss can replicate them. It works, but it is not lite. The container passivates and activates beans, and replication adds more churn. You will feel it under load.

Use SFSB for tight conversations that truly need server side state. Keep the bean small and short lived. Push anything big to the database or cache. Often the better move is to keep a conversation key in the HTTP session and reload state per request.

How do JGroups and multicast affect the party?

Clusters discover each other with JGroups. Out of the box it talks over multicast. If your network scopes are wrong, you might see nodes from a different test cluster join by accident. Give each cluster its own address and port.

On chatty apps, consider async replication and buddy groups. On money moves where you accept a slower page for stronger state, go sync. Either way, test on the same switch shape you will run in prod.

What do we watch during failover?

Kill a node while running a load test. Watch Apache logs, JBoss logs, and user cookies. Sessions should pick a backup and keep going if replication is on. SFSB calls should retry on an HA proxy and land on a peer.

Look for long GC pauses, big network spikes, and session sizes creeping up. Those are your early warnings. A cluster that looks fine at five users can fall down at five hundred with the same bad shape in play.

What are the classic gotchas?

Non serializable fields hidden in your model. A sneaky logger or a thread handle inside a DTO. Fix by marking them transient or by moving them out of session and SFSB state.

jvmRoute mismatch between Apache and server.xml. You see sticky not working and blame the cluster. It is the cookie tag. Make them match and test with a single user first.

Huge sessions that look fine on local dev. In a cluster they replicate on every attribute change and your p95 goes north. Use attribute level replication and update only when something really changes.

Chatty SFSB with big graphs. Every method call stirs the cache. Trim the bean or switch the flow to stateless with a token in the session.

How do we choose between sync and async?

Ask what happens if you lose the last couple of writes before a crash. If that breaks money or data integrity, pick sync and accept the extra round trip. If the worst case is a user repeating a step, pick async and get better page times.

On a busy site you can mix modes. Use sync for tiny critical flags. Use async for carts and wizards. You can also snapshot only at the end of a request to cut chatter during the page build.

What about remote clients and HA JNDI?

Remote EJB clients can talk to HA JNDI. They get a cluster aware proxy that retries on another node when one dies. Keep client timeouts sane and avoid holding on to remote references forever.

If you can keep calls inside the web tier, do that. Crossing the wire from a fat client adds one more place to stall. When you must do it, keep calls coarse and carry only the data you need.

How do we test like we mean it?

Load the site, then bounce one node every few minutes. Pull the network cable on a node. Change one class and redeploy only on one node to see what breaks. You will find class drift, session bloat, and timing bugs long before your users do.

Track a single user through cookies and logs. Practice the failover path as a playbook. When that pager rings at night you will thank your past self.

So what is the sane default?

Start with sticky sessions. Keep the session tiny and serializable. Turn on session replication with buddy groups only where you cannot lose state. Keep SFSB for the rare spots where a server side conversation truly pays for itself.

Pick sync or async based on risk, not on habit. Give each cluster its own JGroups channel. Test failover with traffic, not just with curl. Ship the same bits to every node. Watch your GC, your wire, and your logs.

Do this and JBoss clustering gives you the thing we all want. Users keep moving, boxes can rest in peace, and your app keeps its cool.

Software Architecture Software Engineering