Index Tuning and Analyzers in Lucene

Search feels fast when the index is honest about what it stores and what it throws away. Lucene gives you all the knobs, but your analyzer and write path decide whether you get speed, recall, or a pile of tiny segments crying for help.

Analyzer choice is a product decision more than a library choice, and Lucene makes that very clear. The StandardAnalyzer with Version.LUCENE_29 is a safe default, but it is not a fit for every field you have. IDs, SKUs, and email addresses want KeywordAnalyzer or WhitespaceAnalyzer, while long form text might want SnowballAnalyzer for stemming or a custom chain with LowerCaseFilter, StopFilter, and PorterStemFilter. For multilingual content, consider per field analyzers so titles in English and tags in Portuguese do not fight each other; new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_29)) lets you plug in a different analyzer per field name. If your users paste accented text, add ASCIIFoldingFilter to make café equals cafe without a special query parser. Most issues I see in the wild are not about query syntax at all, they are about a mismatch between how you index and how you search, so always reuse the same analyzer family on both sides of the pipe, for example QueryParser qp = new QueryParser(Version.LUCENE_29, "body", analyzer).

Index tuning is about tradeoffs you can measure not about one magic setting. If you feed Lucene one document at a time with tiny commits, you will create lots of segments and pay the merge bill later, so batch when you can and let memory breathe with writer.setRAMBufferSizeMB(64.0) or more if the box can handle it. Keep segments larger by lowering the churn of merges with a higher merge factor, for example writer.setMergeFactor(15) if your write pattern is steady; yes it is marked as old, but it still works today with the default log merge policy. Skip compound files for speed during bulk loads with writer.setUseCompoundFile(false) and turn it on for deployable artifact size if you care about file count in production. Near real time readers landed and they are worth a look if you need quick search after a write without a full reopen; try IndexReader r = writer.getReader() and you will see fresh results while keeping throughput decent. Keep in mind that deletions live until merges reclaim them, so your disk footprint can look larger than your document count suggests, which is normal for an inverted index doing its job. If you profile, watch segment count, pending merges, and the ratio of time in addDocument versus time in commit, then adjust RAM buffer and merge factor as a pair rather than yanking just one knob.

Fields are contracts and small flags change both memory and scoring. Use Field.Index.NOT_ANALYZED for exact keys and Field.Index.ANALYZED for text; do not mix both unless you want to query both ways. For long static fields like IDs or categories, call field.setOmitNorms(true) to save RAM and remove length normalization from scoring, which can remove noisy boosts on short fields; on the flip side, keep norms for the body so tf idf stays meaningful. If you plan to highlight snippets or build fast explain views, store term vectors with positions and offsets via Field.TermVector.WITH_POSITIONS_OFFSETS to avoid reanalyzing text later, which keeps your highlighter from being the new bottleneck. Set your parser to a sane default with qp.setDefaultOperator(QueryParser.Operator.AND) if your product favors precision, and think about synonyms by expanding during indexing with a SynonymFilter so queries stay simple; if you expand on search, you will pay for that work every time. Here is a tiny custom analyzer that hits the usual needs without getting fancy: TokenStream ts = new PorterStemFilter(new StopFilter(true, new LowerCaseTokenizer(reader), StopAnalyzer.ENGLISH_STOP_WORDS_SET)), which gives you lowercase, stop words, and stemming in one shot. And to keep per field control simple, wire it up like this: PerFieldAnalyzerWrapper pfa = new PerFieldAnalyzerWrapper(defaultAnalyzer); pfa.addAnalyzer("id", new KeywordAnalyzer()); pfa.addAnalyzer("tags", new WhitespaceAnalyzer());.

Sample write path that leans on these ideas in one place, plain and simple: Analyzer analyzer = pfa; IndexWriter writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED); writer.setRAMBufferSizeMB(64); Document doc = new Document(); Field id = new Field("id", idValue, Field.Store.YES, Field.Index.NOT_ANALYZED); id.setOmitNorms(true); doc.add(id); doc.add(new Field("title", titleText, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); doc.add(new Field("body", bodyText, Field.Store.NO, Field.Index.ANALYZED)); writer.addDocument(doc); writer.commit(); Then at query time, keep it predictable: QueryParser qp = new QueryParser(Version.LUCENE_29, "body", pfa); qp.setDefaultOperator(QueryParser.Operator.AND); Query q = qp.parse(userInput); TopDocs td = searcher.search(q, 20); If results look odd, inspect the analyzer output with a quick utility that prints tokens so you can see what really went into the index. When the index and the query share the same view of the text, tf idf does the heavy lifting and your tweaks become small, not desperate. That is the sweet spot.

Great search feels simple because someone made clear choices early.

Software Architecture Software Engineering