Versioning and Queries in JCR - CMO & CTO (An AI Generated Experiment to the past)

“Content without memory is gossip. A repository with memory is a promise.”
notes from a late night commit message

If you spend your days near a Java Content Repository you already know two topics keep coming back in standups. Versioning and queries. You either broke a branch of content and need to rewind time, or you wrote a query that looks smart but eats the CPU for breakfast. After a few projects with Jackrabbit and friends, this is my field kit. It is not theory. It is what keeps builds green and releases calm.

A short story about time travel with mix versionable

We had a content model where pages lived under /content/site, and each page node wrapped a jcr:content node with real properties. We turned on mix:versionable and felt smart. Then one afternoon someone edited a template, saved, and wanted to roll back just that one page. The restore brought back more than they expected, because we forgot that only the nodes with mix:versionable keep a history. The parent without the mixin did not care about time, so child order bounced. That taught us a simple rule. Make the exact nodes you want to restore versionable. Not the ancestors. Not the siblings. The ones you need to rewind during a fire drill.

That same week we met the frozen node. We tried to edit a property on a version snapshot. No luck. A frozen node is a read only picture. To change content you always go back to the live node, checkout, edit, checkin. Snapshots are the consequences, not the origin.

The night the query went rogue

Another day, same sprint. A report page was timing out. The query looked harmless. It searched for all pages with a tag and a publish date, then sorted by date. There was no path restriction. It walked the entire repo. On our laptops it was fine. On staging with a few hundred thousand nodes it started a bonfire. The fix was almost silly. We added an ISDESCENDANTNODE constraint to keep it inside /content/site. We also filtered by node type to skip binaries. It went from minutes to seconds.

I keep that night in my pocket. Every JCR query wants a path fence. Give it one and it behaves. Leave it free and it eats the house.

Deep dive one. Versioning that does not bite

JCR versioning is solid once you wire the basics. The center pieces are mix:versionable, VersionManager, and the version history. Here is the flow I keep as muscle memory:

Add mix:versionable to the node you want to track.
Call checkout, change the node, save, then checkin.
Never modify jcr:baseVersion yourself. That is a link the repo manages.
Restores go through the VersionManager or Workspace.

import javax.jcr.*;
import javax.jcr.version.*;

public class VersioningBasics {

  public void makeVersionable(Session session, String path) throws RepositoryException {
    Node node = session.getNode(path);
    if (!node.isNodeType("mix:versionable")) {
      node.addMixin("mix:versionable");
      session.save();
    }
  }

  public Version checkinChange(Session session, String path) throws RepositoryException {
    VersionManager vm = session.getWorkspace().getVersionManager();
    // Take the node out of the frozen state
    if (!vm.isCheckedOut(path)) {
      vm.checkout(path);
    }
    Node node = session.getNode(path);
    node.setProperty("title", "New title at " + System.currentTimeMillis());
    session.save();
    // Create a new version
    return vm.checkin(path);
  }

  public void restorePrevious(Session session, String path) throws RepositoryException {
    VersionManager vm = session.getWorkspace().getVersionManager();
    VersionHistory vh = vm.getVersionHistory(path);
    // Get the previous version by walking predecessors
    Version base = vm.getBaseVersion(path);
    Version[] preds = base.getPredecessors();
    if (preds != null && preds.length > 0) {
      vm.restore(preds[0], true);
    }
  }

  public void listHistory(Session session, String path) throws RepositoryException {
    VersionManager vm = session.getWorkspace().getVersionManager();
    VersionHistory vh = vm.getVersionHistory(path);
    VersionIterator it = vh.getAllVersions();
    while (it.hasNext()) {
      Version v = it.nextVersion();
      Node frozen = v.getFrozenNode();
      // Read snapshot properties from the frozen node
      String frozenTitle = frozen.hasProperty("title") ? frozen.getProperty("title").getString() : "(no title)";
      System.out.println(v.getName() + " -- " + frozenTitle);
    }
  }
}

Gotchas that hurt once. The first checkin after adding mix:versionable creates the root version. That one is not a real change, it is a starting point. Restores can be shallow or deep. The boolean you pass to restore decides if children without matching versions get replaced. Be careful when the node has many children. If you need a label for business users, use the version label feature. It gives a human name to a snapshot.

For content branches keep this sword in its sheath. You can use merge across workspaces, but that is advanced terrain. If you do, watch for jcr:mergeFailed markers. They point to nodes that need a manual decision.

Deep dive two. JCR queries you can trust

The spec gives you JCR SQL2 and XPath. Both work, SQL2 is the one to keep by your side. It is expressive, the grammar is fixed in JCR 2, and it maps well to the Query API.

Typical tasks with SQL2:

Find nodes under a path with a property filter.
Search full text in a subtree.
Order by a date and return a small page of rows.

// Pages under /content/site using a specific template
String sql2 =
  "SELECT * FROM [cq:PageContent] AS c " +
  "WHERE ISDESCENDANTNODE(c, '/content/site') " +
  "AND c.[cq:template] = '/apps/site/templates/article'";

// Binary assets with a tag and a text hit
String assets =
  "SELECT [jcr:path], [jcr:score] FROM [dam:AssetContent] AS a " +
  "WHERE ISDESCENDANTNODE(a, '/content/dam/site') " +
  "AND a.[cq:tags] = 'site:featured' " +
  "AND CONTAINS(a.*, 'camera') " +
  "ORDER BY [jcr:score] DESC";

// Recent pages by date with paging
String recent =
  "SELECT c.* FROM [cq:PageContent] AS c " +
  "WHERE ISDESCENDANTNODE(c, '/content/site') " +
  "AND c.[publishDate] IS NOT NULL " +
  "ORDER BY c.[publishDate] DESC";

QueryManager qm = session.getWorkspace().getQueryManager();
Query q = qm.createQuery(sql2, Query.JCR_SQL2);
q.setLimit(50);
q.setOffset(0);
QueryResult r = q.execute();
for (Row row : r.getRows()) {
  // Read columns or walk to nodes
}

XPath still shows up in old code, so here is the same vibe in that style:

// Pages under /content/site by template
String xp = "/jcr:root/content/site//element(*, cq:PageContent)[@cq:template='/apps/site/templates/article']";

// Assets with a tag and text
String xpAssets = "/jcr:root/content/dam/site//element(*, dam:AssetContent)[jcr:contains(., 'camera') and @cq:tags='site:featured']";

Query qx = qm.createQuery(xp, Query.XPATH);
QueryResult xr = qx.execute();

Working habits that pay off:

Always include a path fence with ISDESCENDANTNODE or ISCHILDNODE.
Always narrow by node type. Querying [nt:base] is a red card.
Return only what you read. In SQL2, list the columns you need like [jcr:path] or a specific property. Star looks easy but costs memory.
Apply filters on properties, not on functions of properties. Let the index do the heavy lifting.
Use setLimit and keep pages small. You can stream large results but your app probably does not need that.

On old scripts I still see property names with typos. JCR will not warn you. The query runs and returns nothing. When in doubt, start with a wide query inside a tight path and print out one row to inspect the available property names. Then add filters.

Deep dive three. Indexes, speed, and how not to wake the on call phone

There are two common engines today. Jackrabbit classic with Lucene as a search index, and the newer Oak that aims at the next era. If you are on AEM or CRX you might be on classic right now with an eye on Oak. The lessons are similar. Give the repository what it needs to answer your query without a walk of the entire tree.

Classic Jackrabbit uses a Lucene index behind the scenes. You can tune the SearchIndex in repository.xml and set things like merging and analyzer. For most projects the defaults are fine. What really helps is good content modeling. Keep the properties you filter on at the node you query. Avoid nesting the filter property in a random child. You want an index that can hit a term by path and type quickly.

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
  <param name="path" value="${rep.home}/workspaces/${wsp.name}/index"/>
  <param name="supportHighlighting" value="true"/>
  <param name="minMergeDocs" value="1000"/>
  <param name="maxMergeDocs" value="100000"/>
  <param name="mergeFactor" value="10"/>
</SearchIndex>

On Oak the story is more explicit. Queries need matching oak:index nodes. A property filter without a property index can turn into a traversal. You do not want that on prod. Add a property index for fields you query often, and a Lucene index for full text.

// Create a property index for "publishDate" on cq:PageContent
// This is expressed as repository content under /oak:index
/oak:index/publishDateIndex
  jcr:primaryType = "oak:QueryIndexDefinition"
  type = "property"
  propertyNames = ["publishDate"]
  declaringNodeTypes = ["cq:PageContent"]
  reindex = true

// Simple Lucene full text index
/oak:index/lucene
  jcr:primaryType = "oak:QueryIndexDefinition"
  type = "lucene"
  async = "async"

Two field tips for Oak. The async flag means the index updates a bit after the content changes. Your query might not see a fresh edit for a short moment. If you are writing tests, add a small wait or commit again to give the index time. Also, use the explain feature to see which index a query uses. In AEM there is a query tool that prints the plan. If your plan says traversal, change something.

On both engines, make peace with path restrictions. They are the cheapest filter you have. They cut the search down before the property checks run. When you need global searches, keep them on well indexed type families like assets or pages, not on a random mix under the repo root.

Bonus tips from the trenches

Labels for humans. After you checkin a meaningful state, label it. Business folks remember names better than version ids.
Store dates as dates. Use Calendar properties, not strings. Sorting works and index ranges are fast.
Think in node types. Custom types keep your queries clean. Searching [my:Article] reads better than fishing inside a bag of [nt:unstructured].
Observation is your friend. If you must react to changes, listen for events and update a denormalized summary node that answers your queries fast. For example, a count of featured articles per section.
Keep binaries out of the way. Do not query under the DAM if you are looking for pages. Separate trees win.

Reflective close

Versioning keeps your story straight. Queries tell the story back to you. When both are simple and explicit, the repo feels like a co worker rather than a riddle. Add mix:versionable only where you need time travel. Checkin often with intent. Write JCR SQL2 with a path fence and a clear type. Give the indexes a chance to shine. It is not glamour work. It is the quiet path to features that do not wake the on call phone.

We are in a good moment for content tech. Jackrabbit is steady, Oak is growing, and the JCR 2.0 spec gives a common tongue. The tools are ready. The rest is craft. Keep your nodes honest, your queries polite, and your versions named like you mean it.

Software Engineering Technical Implementation