Designing aem workflows that scale

Creation date: 2016 05 16

Story led opening

Friday night at a media company. Editors are bulk dropping a new season of trailers into DAM. The DAM Update Asset model wakes up, the queue grows, and your author goes from smooth to syrup. Thumbnails crawl in. Someone tries to publish a page and waits. Another person moves a folder and everything retriggers. Slack pings. You look at /var/workflow and feel the room spin.

We have fresh AEM 6.2 bits on the bench and Mongo folks debating with Tar folks. ImageMagick sits there, chewing CPU like a champ. The thing is not that AEM workflows are bad. They are chatty by design. If you do not shape them, they will shape your night. This post walks through how to design AEM workflows that can carry weight without taking your author cluster down with them.

Analysis

AEM workflows ride on two pillars. There is the Granite Workflow engine that persists work items under /var/workflow and there is the Sling Jobs layer that powers async work. Launchers listen to repository events and kick off models. Each step writes state. Each transition writes state. On a calm day, this is fine. On a large import, this is a lot of writes.

Bottlenecks show up fast:

Heavy steps like DAM renditions and metadata extraction eat CPU and I O. If they run on your author that also handles editors, life gets bumpy.
Audit noise from long models. Every hop writes more nodes. On Oak this means more merges and compaction work.
Launcher storms when a move or a reupload wakes a model for thousands of assets.
Script steps that are easy to write but slow. The ECMA script step is nice for a demo. In bulk, Java wins.

There are proven moves that help right away:

Transient workflows. In the model editor, tick Transient for high volume models. This skips storing runtime history and reduces writes. For DAM Update Asset this alone can change the curve.
Offload heavy steps. Use offloading to send certain topics to worker instances. Let authors edit on author A, push rendition work to worker B and C. Keep the step count slim on the author nodes.
Short models. Break giant models into small ones. Think of a light triage model that tags and routes, then hands off to a separate heavy model on workers.
Launcher filters. Build tight conditions. Only start on needed mime types and folders. No need to resize PDF thumbnails when you only care about videos.
Thread pools and queues. Tune Sling job queues for your topics. A smaller but steady parallelism beats a huge burst that starves everything else.

Code example: a light workflow step that hands heavy work to a job

This keeps the workflow engine lean and lets a dedicated queue do the heavy lifting.

package com.example.aem.workflow;

import com.adobe.granite.workflow.exec.WorkItem;
import com.adobe.granite.workflow.exec.WorkflowProcess;
import com.adobe.granite.workflow.exec.WorkflowSession;
import com.adobe.granite.workflow.metadata.MetaDataMap;
import org.apache.felix.scr.annotations.*;
import org.apache.sling.event.jobs.JobManager;
import org.osgi.service.component.ComponentContext;

import java.util.HashMap;
import java.util.Map;

@Component(label = "Scale friendly Resize", immediate = true, metatype = true)
@Service(WorkflowProcess.class)
@Properties({
    @Property(name = "process.label", value = "Scale friendly Resize")
})
public class ResizeProcess implements WorkflowProcess {

    @Reference
    private JobManager jobManager;

    @Override
    public void execute(WorkItem workItem, WorkflowSession session, MetaDataMap args) {
        String path = workItem.getWorkflowData().getPayload().toString();
        Map<String, Object> props = new HashMap<>();
        props.put("path", path);
        props.put("renditions", "web,thumb,tablet");

        // hand off to a Sling job topic, processed by worker instances
        jobManager.addJob("com/example/asset/resize", props);
    }
}

Pair this with a queue config for topic com/example/asset/resize on worker nodes.

To go further, consider how you start workflows during bulk loads. During a migration or sync, you probably want to disable launchers and run a model on demand, in batches, with back pressure. This prevents a storm of tiny starts that hammer Oak.

// Groovy script example for the AEM Groovy Console
def wfSession = resourceResolver.adaptTo(com.adobe.granite.workflow.WorkflowSession)
def model = wfSession.getModel("/var/workflow/models/custom-dam-update")

def paths = [
  "/content/dam/videos/season1/trailer1.mp4",
  "/content/dam/videos/season1/trailer2.mp4"
]

paths.eachWithIndex { p, i ->
  def data = wfSession.newWorkflowData("JCR_PATH", p)
  wfSession.startWorkflow(model, data)
  if (i % 50 == 0) {
    sleep 1500 // simple back pressure for big runs
  }
}

Also look at ACS AEM Commons tools for bulk workflow control and purging. They save a lot of time.

Risks

Transient models drop audit. You trade less disk churn for fewer breadcrumbs. For legal or long lived business trails, keep a non transient model or log to an external store.
Offloading drift. If a worker falls out of the topology, topics can pile up. Watch the offloading browser and alerts.
Launcher loops. A step that writes under the payload path can retrigger the same launcher. Lock down paths and use a guard property like skipProcessing.
Queue starvation. A single huge topic can steal all threads if you leave default job queues. Separate topics and give them their own limits.
Move storms. Moving a large folder can fire launchers for each node. For big refactors, pause the launcher, move, then run a targeted workflow once.

Decision checklist

Is this model asset heavy or page centric? Different shape and pressure points.
Can this model be transient without breaking audit needs?
Which steps are CPU bound and deserve offloading to workers?
What mime types and folders should the launcher watch, and which should it ignore?
Do we need to throttle starts during imports or sync jobs?
Do we have separate queues for heavy topics with sane max parallel?
Is there a purge policy for old workflow instances and payload temp data?
Are publish and dispatcher flush steps isolated from DAM heavy work?
Are script steps that run often rewritten in Java for speed and safety?
Do we have alerts on queue growth and blocked jobs?

Config sample: Sling Job Queue for a heavy topic

Create an OSGi factory config of org.apache.sling.event.jobs.QueueConfiguration on worker nodes:

# com.example.resize.queue.cfg
queue.name=resize-worker
queue.topics=[com/example/asset/resize]
queue.type=TOPIC_ROUND_ROBIN
queue.maxparallel=4
queue.keepJobs=true
queue.retrydelay=15000
queue.retries=3

Start small on maxparallel and bump based on CPU and I O headroom.

Action items

Map your models. For each one, list steps, expected volume per day, and biggest cost. If you do not know, run a small batch and watch CPU, I O, and queue depth.
Make DAM Update Asset transient if your org can live without per asset audit there. Keep a separate light model for legal cases that truly need history.
Split heavy work. Move image resize, video transcode, and metadata grind into steps that publish jobs to worker topics. Keep author bound steps short.
Set up workers. Bring up one or two AEM worker nodes with offloading enabled. Pin heavy topics there. Keep editors off those boxes.
Tune queues. Create separate queue configs for heavy topics. Give publish flush its own lane so editors are not blocked.
Harden launchers. Add mime type rules and path rules. For example, ignore /content/dam/tech temp folders entirely.
Control imports. During bulk loads, disable the launcher, ingest assets, then start the workflow in batches with a script. Re enable after.
Purge on a schedule. Add a weekly job to purge completed workflows older than a window. Keep /var/workflow lean.
Move only with plan. For big folder moves, pause launchers, move, then run a single repair model to rebuild references.
Rewrite slow script steps in Java. Measure before and after. Keep scripts for rare paths and prototypes.

One last tip. When a step has to touch the payload and write properties, write under a dedicated child like jcr:content/processing. Then set your launcher to ignore that child path. That tiny change saves loops.

AEM 6.2 brings nice touches like transient models and better offloading out of the box. Pair that with a clean split between author and workers, tight launchers, and sensible queues, and your Friday night can be boring in the best way.