Skip to content
tatami

Managing a collection

Stitch many tatami files into one logical dataset with a manifest, prune whole files before opening them, look a key up across the set, and compact the log.

A single .tatami file is a shard. A crawl is thousands of shards, and you want to treat them as one dataset: scan across all of them, look a key up without knowing which file holds it, and add or retire files over time. A tatami.manifest is the catalog that makes that work.

How the manifest works

The manifest is an append-only log in a directory next to the files. Each record adds, removes, or re-tiers a member, and carries a rollup of that file's key range and per-column zone statistics. Because it is append-only, a write is one fsync of a new record, and a torn write at the tail is discarded on read, keeping the last consistent prefix. The rollup is what lets a query prune at the file level: a scan can rule out a whole file from its zone summary before opening it.

Add files

tatami collection add ./corpus shard0.tatami shard1.tatami shard2.tatami

This catalogs each file into the collection rooted at ./corpus. The files can live anywhere the path can reach; the manifest records how to find them and what they can be pruned on.

List members

tatami collection list ./corpus
members: 3 live
  shard0.tatami  key [https://a ... https://m]  zones: status, crawl_date
  shard1.tatami  key [https://m ... https://z]  zones: status, crawl_date
  shard2.tatami  (no sort key)                  zones: status, content_type

The listing shows each live member, its key span if it has a sort key, and the columns the manifest can prune on. A converted shard with no sort key still prunes by its zone summary.

Scan and look up from Go

The Go API opens a collection over a directory and reuses the same predicate evaluator as a single file, lifted to the file level.

c, err := tatami.OpenCollection("./corpus")
if err != nil {
	log.Fatal(err)
}

scan, err := c.Scan(tatami.Eq("status", int32(404)), "url", "status")
if err != nil {
	log.Fatal(err)
}
fmt.Printf("scanned %d of %d files\n", scan.FilesScanned, scan.FilesTotal)

hit, ok, opened, err := c.Lookup("https://example.com/page")
if err != nil {
	log.Fatal(err)
}
if ok {
	fmt.Printf("found in %s, opened %d file(s)\n", hit.Member, opened)
}

Scan reports how many files it actually had to open against the total, which is the file-level analogue of group pruning. Lookup on a set of disjoint sorted files opens exactly the one file whose key range contains the key.

Merge and compact

Two operations keep a collection healthy as it grows.

collection compact rolls the append-only log into a fresh one containing only the live set, dropping the accumulated add-and-remove churn. It writes a new manifest to a temp file and renames it into place, so the swap is atomic.

tatami collection compact ./corpus

A data-level Merge (from the Go API) decodes several member files and re-encodes them into one, then swaps the inputs out and the output in as a single manifest batch, so a reader never sees a half-merged state. This is how small files fold into larger ones over time.

Where to go next