Scaling Jackrabbit: Clustering Basics

When your content repo starts panting in production, Apache Jackrabbit usually gets blamed first. In truth, scaling Jackrabbit is less about magic flags and more about a few non negotiable basics. Clustering gets you headroom, but only if the pieces are in the right places.

Before we jump in, context helps. Teams shipping on Sling and AEM are loading up more binaries, more nodes, more everything. Docker just went 1.0 and ops folks are excited, but Jackrabbit still wants boring, predictable foundations. Let us set those.

What does clustering mean in Jackrabbit?

In Jackrabbit classic, a cluster is a set of repository instances that share the same content state and keep in sync through a shared journal. Each node runs its own JVM, serves its own clients, and writes changes that every other node replays from the journal.

Sharing happens at the storage layer, not through chatter between nodes. No gossip ring. No leader election. Just a journal that all nodes read and write in order.

Why bother clustering Jackrabbit?

Two big wins. Availability, since one node can go down without taking the repo with it. And throughput, mainly for reads. You can also deploy without downtime by rotating nodes off and on, which your editors will love during peak hours.

Clustering is not a silver bullet for slow queries or bad content models. It multiplies what you already have. Good storage and healthy queries get better. Bad ones echo across more servers.

Which parts must be shared and which must be local?

Three core pieces matter: the Persistence Manager, the Data Store for binaries, and the Journal. Most production setups use a database backed Persistence Manager, a shared file system for binaries, and a database backed journal.

The Persistence Manager and Journal use the same database server or at least the same class of database with correct isolation. The Data Store must be shared across cluster nodes, or you will duplicate blobs and risk losing binaries.

What is a clean mental model for the data flow?

Write happens on Node A. The Persistence Manager stores bundles and references. The Data Store writes the binary only if missing. Then the Journal records the change. Nodes B and C poll the Journal, replay the change, and update their local search index.

Think of the Journal as the truth for sequencing. If it jams, the cluster lags. If it is quick, nodes stay tight.

Which database is a good fit for the Persistence Manager and the Journal?

Pick a reliable relational database you already run well. PostgreSQL, MySQL, and Oracle are all fine with proper settings. Keep the repo in its own schema. Give it predictable IOPS and low latency. Plan for growth.

Use transaction isolation that matches Jackrabbit expectations. READ COMMITTED is the usual pick. Avoid quirky autocommit tricks. If you see deadlocks on the Bundle tables or Journal tables, check transaction sizes, slow storage, and long held connections first.

Does the Data Store need to be shared?

Yes. The Data Store holds the heavy stuff. Videos, images, PDFs, you name it. A shared FileDataStore on a stable network file system keeps those binaries available to all nodes without duplication.

If the Data Store sits on NFS, go for simple and proven mounts. Keep it close to the app servers. Do not put Lucene indexes on NFS. Only binaries. Stick to a layout that ops can monitor and back up cleanly.

How should I treat the search index in a cluster?

Each cluster node keeps a local Lucene index. Nodes catch up by replaying the Journal. This keeps the index fast and the disk local. Putting the index on shared storage is a quick way to tears.

If indexing falls behind, watch the Journal lag and the indexing queue. Slow disks and noisy neighbors will show up here first. Give each node good local SSDs if you can, or at least a tier that is not busy with other services.

What are the must have cluster settings?

Add a unique clusterId per node. Point all nodes to the same DatabaseJournal config. Set a reasonable syncDelay to balance freshness and database load. Many teams live in the 2 to 5 second range for sync delay.

Size caches for your workload. Jackrabbit has several caches for nodes, properties, and paths. More is not always better. Measure the hit rates and tune with intention. Watch garbage collection in the Data Store as well, and schedule it during quiet hours.

How big does the Journal get and how do I keep it healthy?

The Journal is an append only log. It can grow into millions of entries in busy repos. On restart, a node replays from its last known revision. Long replays slow startup and hide surprises until traffic hits the node.

Use the Journal cleanup tools that ship with your version. Archive old revisions. Keep the table indexed. If you see replay times creeping up, plan a tidy up before it becomes a fire drill.

Can I mix different versions of Jackrabbit in a cluster?

Do not mix versions in the same cluster. Schema changes and subtle behavior differences will bite. Upgrade by taking nodes out, upgrading, then bringing them back, one at a time. Keep configs symmetric.

The same rule applies to custom Persistence Managers and custom SearchIndex classes. Either all nodes use the same versions or you get inconsistent state.

What about session stickiness and load balancers?

Jackrabbit clustering does not require the HTTP layer to be sticky. The repository is behind your app code. If your app relies on web session state that is not replicated, then your load balancer needs to be sticky for the app, not for Jackrabbit itself.

That said, many real stacks use sticky routing for convenience. Just make sure cluster nodes share content through the Journal and shared Data Store, not through the load balancer.

What does a safe rollout look like?

Start with a single node that uses the target database and shared Data Store. Get it stable under load. Add the Journal config and let it run alone. Confirm writes land in the Journal and that restarts replay cleanly.

Clone the node, change the clusterId, point to the same database and Data Store, keep the index local, and bring it up. Watch the logs for revision sync. Put traffic on both. Verify observation events arrive on both.

What are the classic footguns?

Same clusterId on two nodes. They step on each other. Give each node a unique value.
Local Data Store on one node. Binaries vanish for others. Make it shared for all nodes.
Search index on shared storage. Index gets corrupt or slow. Keep it local to each node.
Clock skew. Odd ordering and event delays. Sync NTP everywhere.
Mixed JDBC drivers or isolation settings. One node stalls the others. Standardize.
Overeager antivirus or backup on the Data Store path. Locks and slow reads. Exclude it.
Database with surprise connection limits. Nodes stall. Size the pool and the DB together.

How do observation and events work across the cluster?

Events produced by a change on one node reach the others after the Journal sync. You get a small delay tied to syncDelay and replay speed. If your app logic depends on instant events, rethink that path or design around eventual delivery.

For monitoring, expose the Jackrabbit JMX beans if you run in an app server that supports it. Track the last seen Journal revision and the indexing backlog. Those two numbers will tell you when a node is falling behind.

What about Jackrabbit Oak and the new story?

Oak is the new store used by fresh AEM releases and it changes the game. With Oak Document storage you lean on MongoDB for clustering. With Tar storage you get speed on a single node. That is a different article. If you run classic Jackrabbit today, the basics in this post still carry you.

How do I pick storage for the Data Store?

FileDataStore on a stable network share is the common pick. It is simple and it works. Keep the path structure default, give it space, and back it up with a strategy that matches your recovery targets.

If you are tempted by clever distributed file systems, test how they behave under rename and stream close patterns. Jackrabbit writes are simple but sensitive to latency spikes and odd locking rules.

What should I monitor day two?

Journal lag per node.
Lucene indexing queue size and time to catch up after bursts.
Database slow queries on Bundle tables and Journal tables.
Data Store free space and garbage collection duration.
JVM GC pauses during heavy writes or reindex.

Set alerts where it hurts. A few well chosen metrics beat dashboards full of noise.

What does a good content model buy me in a cluster?

Everything. Deep paths with hot siblings cause write contention. Huge properties hurt caching. Oversized binaries block threads. A clean model with balanced hierarchies and binary streaming keeps load predictable and reduces Journal churn.

If you tune nothing else, tune queries and access patterns. Clustering multiplies good choices.

Compact conclusion

Clustering Jackrabbit is not mysterious. Share the Persistence Manager via a solid database. Share the Data Store on reliable storage. Keep the Journal healthy and give each node a unique identity. Store the Lucene index locally. Watch lag and size caches with intent.

When these basics are in place, Scaling Jackrabbit is less drama and more routine. You get higher read throughput, safer rollouts, and a calm path to grow. Start small, keep it boring, and let the cluster work for you.

Software Architecture Software Engineering