Serving search over HTTP

Run tatami serve over a directory of search segments: a concurrent lock-free broker with admission control, a per-request deadline, and a smart segment cache sized to hold the working set.

tatami serve puts an HTTP server in front of a broker so a directory of search segments answers queries over the network. It is built to handle thousands of concurrent queries on one process while keeping the latency budget and bounding the memory.

Start the server

Point it at a directory of .tatami search segments:

tatami serve ./segments --addr :8080

opening 254 shards under ./segments
serving 20246 docs across 254 shards on :8080 (cache=64, max-in-flight=256)

It globs the top-level *.tatami files in sorted order so the shard ids stay stable across restarts, builds a routing index across them, and opens a broker that keeps only a bounded working set of segments resident.

Query it

GET /search takes a query q and an optional result count k:

curl 'localhost:8080/search?q=open+source+software&k=10'

{
  "query": "open source software",
  "k": 10,
  "total": 10,
  "took_ms": 1.4,
  "stats": { "candidates": 251, "visited": 83, "threshold": 7.21 },
  "results": [
    { "doc_id": "4f1e...", "url": "https://example.com/a", "title": "An example", "snippet": "open source ...", "score": 9.83 }
  ]
}

took_ms is the wall time the broker spent, the number the ten-millisecond target is read against, and stats shows how few of the candidate shards the answer actually touched. Two more endpoints round it out: GET /healthz is a plain 200 liveness probe a load balancer polls, and GET /stats reports the broker shape and the serving counters.

curl localhost:8080/stats

{
  "shards": 254, "docs": 20246, "cache_len": 64,
  "max_in_flight": 256, "in_flight": 3,
  "total": 10432, "rejected": 0, "timed_out": 0, "canceled": 0, "failed": 0
}

The broker answers without a lock

net/http already gives one goroutine per request, and the broker is safe to call from all of them at once: it routes, prunes, and scores against a reference-counted concurrent segment cache, so two queries that touch different shards never wait on each other. A query is never serialised behind another query's work, so the tail latency under load is the cost of one query, not the sum of the queue ahead of it. A concurrent answer is identical to the single-threaded one: the same shards, the same scores, the same ranking.

A smart cache keeps the working set warm

The latency depends on one thing above all: the segments a query needs are resident. A warm query runs the posting walk and the forward-column read from memory; a cold one pays to decode a whole inverted region first. The working set is the union of the shards the queries visit during routing, which is larger than the set that produces a top result, and a cache below it thrashes. Size --cache to hold the working set:

tatami serve ./segments --cache 254

Sized to the working set, a cycled query mix that thrashed at a smaller cap runs an order of magnitude faster. The cache cap also bounds the memory: the open-file and decoded-index footprint is the cap, not the shard count, so one process serves a shard count it could never hold open at once.

Admission control and a deadline bound a burst

Goroutine-per-request is unbounded by default: a flood of clients spawns a goroutine and an in-flight query each, and the memory grows with the flood. Two limits fix a ceiling that does not move with load.

Admission control. A counting semaphore caps the queries running at once. An arrival past the cap gets a 503 immediately rather than queuing without bound, so the CPU and memory a burst can claim are bounded by --max-in-flight, not by the arrival rate. The slot is held until the search actually finishes, so the cap bounds work and not just connections.
A per-request deadline. A query that stalls on a cold-shard read returns 504 after --timeout rather than tying up a slot indefinitely. The deadline is generous next to the sub-millisecond warm path, so it fires only on a real stall.

Together with the cache cap, these bound both the per-query memory and the resident memory, so a flood degrades into sheds and timeouts rather than into unbounded growth.

Tune it

Flag	Default	When to change it
`--cache`	`64`	Raise it to hold the working set the queries touch, the single biggest latency lever
`--max-in-flight`	`256`	Lower it to protect a small box, raise it on a big one with headroom
`--timeout`	`2s`	Lower it to fail a stalled query faster, raise it if cold shards are genuinely slow
`--max-k`	`100`	Cap how large a result set a single request may force
`--default-k`	`10`	The result count when a request omits `k`

Latency under load

On a real shard split into 254 shards with the working set warm, driven at one in-flight query per core:

Class	p50	p90	p99	Throughput
Single keyword	~140 us	~340 us	~1.4 ms	over 31,000 qps
Multi-term phrase	~1.0 ms	~6 ms	~28 ms	~3,200 qps

Single-keyword serving, the class the ten-millisecond target is stated against, holds a p99 well under the budget at over thirty thousand queries per second. Multi-term phrases are a heavier multi-list traversal; their median stays well inside the budget and their tail is bounded by admission and the deadline rather than left to run away. Through five thousand concurrent queries the resident segment count holds at the cache cap, and the ranking stays exact.

Scale past one process

One Server drives one broker over the shards one machine can hold. To scale past that, run several servers behind a load balancer, or front several brokers with an aggregator so one query reaches a whole fleet of shards.

Where to go next

For the broker, the routing index, search-only segments, and the aggregator tier, see distributed serving at scale.
For every flag and endpoint, see the CLI reference.