2 minute read

In 2015, search infrastructure was undergoing a quiet revolution.

For decades, relational databases had been the go-to solution for storing and querying data. They were dependable, transactional, and came with a robust ecosystem. But once you tried to do anything remotely like free-text search or relevance scoring at scale, you hit a wall.

This was the world before Elasticsearch started becoming mainstream in the developer consciousness — a world of LIKE '%foo%', full table scans, and frustrating performance regressions. Elasticsearch wasn’t just a new database — it was a completely different mental model for working with data.

And at the heart of it was a deceptively simple concept: TF-IDF, scaled to the distributed world.


Let’s be clear: relational databases are excellent at structured querying. They are ACID-compliant, highly optimized for joins, and time-tested for transactional consistency.

But they are not designed for relevance-based ranking.

Here’s why:

  • Text fields are opaque blobs to the RDBMS optimizer
  • There’s no concept of term frequency or inverse document frequency
  • Joins and LIKE clauses don’t scale for multi-word queries
  • Ranking relevance means scoring — and SQL doesn’t natively support that in a meaningful, performant way

So when we tried to implement a basic document search engine on top of PostgreSQL or Oracle back in the day, we ended up hacking our way through full scans, materialized views, or custom ranking tables. It was a mess.

Enter Elasticsearch, built on top of Apache Lucene, bringing real-time, distributed, scalable search to the masses.


TF-IDF in Theory — And Why It’s Expensive

TF-IDF stands for Term Frequency - Inverse Document Frequency. It’s a statistical measure of how relevant a term is to a document in a corpus.

Formulaically:

TF(t, d) = Number of times term t appears in document d
IDF(t) = log(N / DF), where N = total documents, DF = documents with term t

The final score is often:

Score(t, d) = TF(t, d) * IDF(t)

Simple idea. Powerful results.

But when your corpus has millions of documents across hundreds of nodes, calculating and maintaining IDF scores becomes computationally heavy.

In an RDBMS world, you’d be forced to compute aggregates across the full corpus to calculate DF, and then normalize each score on query-time. It’s painfully inefficient.

Lucene — and by extension, Elasticsearch — solved this beautifully.


Lucene Segments: Immutable, Indexed, and Optimized

Lucene doesn’t store documents the way a database does. It stores inverted indexes, mapping terms to document IDs, and it organizes these indexes into segments.

Each segment is a self-contained index. It contains:

  • A term dictionary
  • Posting lists
  • Stored fields
  • Norms for scoring

These segments are immutable, meaning once written, they are never modified. New documents go into new segments. Over time, Lucene performs background merges, combining smaller segments into larger ones, discarding deleted documents, and optimizing search performance.

This immutability makes indexing fast, safe, and parallelizable.

And here’s the kicker: IDF is computed per segment.

So rather than waiting to calculate global statistics across all documents, Lucene can:

  • Score documents locally using per-segment stats
  • Merge and reconcile scores over time
  • Cache frequently accessed terms

This design made TF-IDF computation practical and fast even on massive indexes.


Elasticsearch: Distributed Search as a First-Class Primitive

Elasticsearch extended Lucene with:

  • Distributed sharding (each shard is a Lucene index)
  • Coordinated querying across shards
  • Percolation (reverse queries)
  • Real-time indexing and near real-time search
  • Custom analyzers for language-aware tokenization

When you fire a query in Elasticsearch:

  1. It hits the coordinating node
  2. Query is fanned out to relevant shards
  3. Each shard computes local relevance scores (TF-IDF)
  4. Scores are normalized and merged by the coordinator
  5. Final ranked results are returned

This was very different from RDBMS query planning.

Instead of pushing logic to a central planner, Elasticsearch embraced decentralization, pushing query execution to the data and letting the system combine results in a score-aware fashion.

Even better — the JSON-based query DSL allowed expressive queries like:

{
  "query": {
    "match": {
      "title": "scalable tf idf search"
    }
  }
}

No joins. No schema gymnastics. Just search — fast, distributed, ranked.


Why It Scaled

Elasticsearch made TF-IDF scale for several reasons:

  • Segment-level isolation avoided global locking or stats gathering
  • Immutability enabled safe, concurrent reads
  • Sharding made horizontal scaling straightforward
  • Lucene optimizations like skip lists and norms compressed the index and accelerated lookup
  • Near real-time indexing made writes fast, while still keeping results fresh

And crucially, it let developers think differently.

Rather than modeling everything as normalized tables, we began thinking in documents, relevance, and eventually consistent views.


What We Learned (And Still Holds True)

  1. Search is not a database feature — it’s a different beast.
    Relevance-based retrieval is not a join or filter — it’s about scoring, ranking, and approximations.

  2. Immutability unlocks performance.
    Lucene’s segment model shows how write-once-read-many can outperform traditional mutable stores for certain workloads.

  3. Distributed doesn’t mean complex.
    Elasticsearch abstracted away much of the pain, offering a simple REST API over a deeply powerful engine.

  4. Inverted indexes beat full scans every time.
    For text-heavy datasets, nothing comes close.

  5. Score != truth, but score = relevance.
    In systems where human search is the interface (e.g., e-commerce, support), how you rank matters as much as what you return.


If You’re Curious…

  • Play with Elasticsearch 1.x or 2.x — start small, see how shards and segments work
  • Explore Lucene internals — understanding how terms, postings, and norms work pays huge dividends
  • Read Doug Cutting’s early Lucene papers for insight into its design
  • Watch the evolution toward BM25, which eventually replaced TF-IDF in later Elasticsearch versions

“Lucene didn’t just index text — it changed how we think about data.”

And in 2015, it opened the door for a whole new generation of developers to build fast, scalable, relevance-driven systems, well beyond the reach of traditional databases.

Comments