Sizing and Eviction: Keeping Caches Healthy

Caches get sick. They cough up outages, drop hit ratios, and eat more heap than they should. Last night in the war room someone asked me, “Is EHCache broken or did we break it?” I sipped coffee and said, “Neither. The cache is just the mirror. It reflects what we feed it and what we ask of it.” That started a good chat that I want to write down while it is fresh.

Teammate: We doubled traffic after the last push and latency went up. Hit rate looks fine for a while then tanks. GC goes wild.

Me: Sounds like sizing and eviction. If the cache were a fridge, we stuffed it with party food and forgot to pick what to toss first. Let’s make it healthy again.

Evidence section

I like numbers more than gut feelings. We ran a quick load test on a product detail service backed by EHCache. Here is the setup: Java 6 on a mid range box, Tomcat, Spring 3, Hibernate 3 second level cache off for this test, EHCache fronting a DAO with 250 thousand product rows. Read heavy workload with small bursts of writes during an import job.

Baseline without cache: median 95 ms, p95 310 ms, throughput 450 rps at 75 percent CPU
EHCache defaultish config with tiny heap store and disk overflow: hit rate starts at 92 percent then falls to 68 percent after 12 minutes, p95 bounces between 140 ms and 800 ms, long GC pauses over 600 ms
Right sized heap store with LFU eviction and no disk overflow: steady 96 to 97 percent hit rate, p95 120 to 160 ms, GC minor and quick

Two patterns showed up.

Churn kills cache health. A heap store that is too small causes constant evictions. Then the next request pulls the evicted entry again. The hit rate chart looks like teeth.
Disk overflow hides pain. It keeps the app alive when memory is tight, but it adds IO and serialization cost. With bursty traffic it becomes a cliff. You can hear the disk sigh.

A simple rule helped us reason about the size: approximate the working set. If your site serves 20 thousand hot keys during your peak hour and entries are about 3 KB each, that is around 60 MB for values plus overhead. In Java objects are not free. Double it for comfort and GC slack. So, 120 to 160 MB for that cache feels right. If you run multiple caches, split the pie and leave heap for the rest of the app.

Also watch time to live and time to idle. A long TTL keeps data around but increases the chance of stale reads. A short TTL reduces staleness but increases misses. In the test above, TTL at 10 minutes with TTI at 2 minutes kept hot items warm without locking the old stuff in memory.

Implementation notes

Here is a sane starting point for EHCache that has served us well on a few Java apps this week. Tweak the numbers to your workload. The keys are sizing for the working set, picking the right eviction policy, and controlling disk usage.

<ehcache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="ehcache.xsd"
         updateCheck="false"
         monitoring="autodetect"
         dynamicConfig="true">

  <diskStore path="/var/tmp/ehcache"/>

  <defaultCache
      maxElementsInMemory="50000"
      eternal="false"
      timeToIdleSeconds="120"
      timeToLiveSeconds="600"
      overflowToDisk="false"
      diskPersistent="false"
      diskExpiryThreadIntervalSeconds="60"
      memoryStoreEvictionPolicy="LFU" />

  <cache name="productCache"
      maxElementsInMemory="80000"
      eternal="false"
      timeToIdleSeconds="180"
      timeToLiveSeconds="900"
      overflowToDisk="false"
      memoryStoreEvictionPolicy="LFU" />

  <cache name="categoryCache"
      maxElementsInMemory="1000"
      eternal="true"
      overflowToDisk="false"
      memoryStoreEvictionPolicy="LRU" />

</ehcache>

A few notes on the choices above:

LFU for spiky catalogs. Least frequently used keeps the long term hot keys. LRU works fine for simple browsing flows. FIFO is rare in my world.
No overflow to disk for reads on the hot path. If you need disk, consider a separate cache for long tail items that do not sit in the hot path.
Shorter TTI than TTL to let stale items fade if they are not touched, while keeping a reasonable ceiling on lifetime.

If you wire EHCache by hand, the Java side is tiny:

CacheManager manager = CacheManager.create();
Cache productCache = manager.getCache("productCache");

Element el = productCache.get("sku:" + sku);
if (el == null) {
    Product p = dao.loadProduct(sku);
    productCache.put(new Element("sku:" + sku, p));
    return p;
}
return (Product) el.getObjectValue();

With Spring, you can use the EHCache manager bean and let your services call it cleanly.

<bean id="ehCacheManager" class="org.springframework.cache.ehcache.EhCacheManagerFactoryBean">
  <property name="configLocation" value="classpath:ehcache.xml"/>
</bean>

<bean id="productCache" class="org.springframework.cache.ehcache.EhCacheFactoryBean">
  <property name="cacheManager" ref="ehCacheManager"/>
  <property name="cacheName" value="productCache"/>
</bean>

For sizing, expose JMX and watch live numbers. It takes the guesswork out.

// jconsole or VisualVM with MBeans plugin
// javax.management ObjectName like:
// net.sf.ehcache:type=Cache,CacheManager=__DEFAULT__,name=productCache
// Check attributes: HitRate, MemoryStoreObjectCount, InMemoryHits, EvictionCount

My quick sizing loop:

Start with maxElementsInMemory equal to the estimated working set
Load test to peak and look at EvictionCount. If evictions grow under steady load, raise the cap until evictions flatten
Keep an eye on Old Gen on GC logs. If full GCs arrive during steady traffic, you probably went too far

Risks

Caches feel simple until they bite. The common bites I see:

Stale reads. If your data has user facing money fields or stock, do not rely on long TTLs. Use short TTL with a refresh path or explicit invalidation on writes
Stampede. One popular key expires and a hundred threads rush to rebuild it. Use a soft lock pattern or stash a short lived placeholder while one thread refreshes
Big objects. Serializing fat graphs of entities makes disk overflow and replication heavy. Cache view models or trimmed DTOs
Uneven keys. A few hot keys can dominate memory. LFU helps, but sometimes you need a split cache or even a small local prefetch map in the service
GC pressure. Too large a heap store will push you into long collections. Short pauses beat long pauses under traffic. Smaller but steady is better for most sites
Disk store surprises. On shared boxes the disk path gets slow or fills up. If you must use disk, put it on a fast local path and monitor space and IO wait
Cluster expectations. Replication across nodes can help read hit rates but it carries network and serialization cost. Keep replicated caches small and focused

Also, do not forget invalidations on writes. In Hibernate land, mix EHCache with second level caching only after you understand where the truth lives. If the database is the source of truth, a clear remove on write keeps the story straight. If you need near real time cache coherence across nodes, take a look at Terracotta with EHCache which is getting more attention since the acquisition. It suits some use cases but still needs careful sizing and testing.

Graceful exit

There is a simple rhythm for healthy caches. Measure. Right size to the working set. Pick eviction that matches your traffic. Keep disk for non critical paths. Watch staleness. Tuning EHCache is less about magic flags and more about matching your app shape to memory and time.

If you are short on time, start with this practical checklist:

Estimate working set in entries and bytes
Set maxElementsInMemory using that estimate with buffer
Pick LFU for catalogs and LRU for browsing flows
Use TTI shorter than TTL to let cold keys expire naturally
Keep overflowToDisk off on hot caches and on only for cold long tail
Load test and watch hit rate, evictions, and GC side by side

We talk a lot about frameworks and tooling this summer. Between new Spring bits, Rails still buzzing, and more sites going social, it is easy to forget small knobs like cache sizing. Those knobs pay rent every day. If your cache feels feverish, breathe, grab the metrics, and nudge the size and eviction until the charts calm down. Your users will feel the difference before your pager does.

If you try this and see weirdness, drop the shape of your keys and entry sizes, and I will share a few patterns that worked for us on news feeds, product pages, and account dashboards.

General Software Software Engineering