Scaling Search with Sharding and Caching

Search is eating more of our apps every week. Index sizes grow, queries pile up, and the moment you add faceting, snippets, and sorting by freshness, your single Lucene box starts to sweat. Solr makes a lot of this smoother, and Lucene 3 is a beast in the best sense, but there is a point where one machine is not enough. That is when two old friends step in with a grin: sharding and caching. If you are building on Lucene or Solr right now, or kicking the tires on the very new Elasticsearch wave, this is your cheat sheet from the trenches on scaling search with sharding and caching.

Definitions

Sharding: split one logical index into multiple smaller indexes that live on different machines. Each shard holds a slice of the documents. Queries fan out to all shards, each shard returns top hits, and the coordinator merges and ranks the final result.

Replicas: extra copies of a shard to handle read load and failover. Writes go to the primary and then out to replicas. More replicas means more read throughput and safer nights.

Routing: the rule that decides which shard gets a document or a query. The simple choice is to hash an id. A smarter choice in some cases is to route by tenant or category to keep related data together.

Caching: keep hot data close so repeated work is cheap. In Solr and Lucene you will meet a few types. A query result cache saves the top results for a query string. A filter cache saves bitsets for common filters like category or user permission. A document cache keeps recently fetched stored fields. The operating system page cache helps too since Lucene is very friendly to it. For the network edge, memcached still earns its keep for whole page or fragment caching.

Segment merging: Lucene writes new segments and later merges them in the background. Merges can spike IO and memory. Your sharding plan should leave headroom for merge bursts.

Examples

Store search with a growing catalog. You have about ten million products, each with twenty fields and a couple of facets. A single well tuned box gives you about two hundred queries per second with median latency under one hundred milliseconds while the ninety fifth percentile creeps past five hundred during merge storms. Split into four shards with two replicas each. Route by product id. Add a generous filter cache for brand and price buckets. Hit rate climbs, the coordinator merges eight partial tops, and your slow tail gets tamed without exotic gear.

News site with hot queries. Traffic spikes on breaking topics. The same three queries repeat all day. A query result cache with a short time to live and a filter cache for sections cleans up most of the load. Then add two replicas per shard so cache hits spread out. You may not need more shards yet. You save sharding for the day when the index does not fit in RAM plus page cache comfort.

Multi tenant app. Thousands of small tenants, each with modest data. Route by tenant id so a single shard can answer a tenant search without touching others. Queries become lighter, caches become sharper because the same filters and query shapes repeat within a tenant. If one tenant grows out of proportion, you can rebalance by moving that tenant to its own shard group.

Counterexamples

Tiny index spread too thin. If your index is small enough to sit warm in memory on one box, sharding just adds network hops and merge cost at the coordinator. Latency goes up, not down. Keep it simple until you outgrow that comfort.

Global sort that needs the full picture. Sorting by a global signal like recency across all documents can hurt when sharded. Each shard sends top results, but the true global top can hide below the local tops. You need deeper fetch from each shard or a second pass which adds work. Better to pre compute a time decay score or narrow the query first with filters that match user intent.

Facet accuracy with sparse data. If facets are sparse and you need exact counts, sharding can produce noisy partials. Per shard caches do not help much because the match sets are small and varied. In that case, a single shard with more RAM and a tuned filter cache might beat a cluster.

Decision rubric

Start with caching. Turn on query result cache and filter cache. Track hit rate. If hit rate climbs above sixty percent and your tail latency still hurts, consider sharding.
Watch memory. Aim for hot postings and term dictionaries to live in RAM plus page cache. When index size grows past what one box can keep warm with comfort, split into shards.
Target tail latency. Shard when the ninety fifth percentile blows past your budget during merges or peak load even after cache tuning and IO tweaks.
Balance shard count and merge cost. Fewer large shards mean fewer coordinations but heavier merges. More small shards reduce merge spikes but add fan out and merge at query time. Most teams find a sweet spot around four to eight shards per logical index at first.
Plan replicas for read load. If you need more reads, add replicas before adding shards. Replicas are simpler and boost cache hit rate.
Pick routing that matches queries. Hash by id for general use. Route by tenant for multi tenant apps. Route by category if users filter by category almost every time.
Leave merge headroom. Keep CPU and IO at sixty to seventy percent at peak, or merges will tip you over.
Measure fan out. Track query fan out count, partial time per shard, and coordinator merge time. If coordinator time dominates, you have too many shards or too little work per shard.

Lessons learned

Sharding is not a prize, it is a tool. The best path is usually cache first, shard second, replicas along the way. Keep routing simple until you hit a clear mismatch with your query patterns. When you shard, size shards so hot data stays warm. Watch your slow tail like a hawk, since users feel that more than the shiny median.

Lucene gives you speed and Solr gives you a grown up home for it. Caching gives you free wins when traffic repeats itself, and it always does. Sharding gives you room to grow when memory says stop. Mix both and your search keeps pace with your app without hand waving or late night pager duty.

If you remember one thing, make it this: cache what repeats, shard what cannot fit, replicate what must not fail. That is the simple loop that keeps query throughput high, keeps latency calm, and keeps your team shipping features instead of chasing spikes.

Software Architecture Software Engineering