Skip to content

Robustness

Good retrieval quality is the easy part. The hard part of a context layer is running it every day: a sync killed halfway, two processes touching the same store, an index cleared down to a stub, one bad file taking out a whole run. Those are state, lifecycle, and concurrency problems — orthogonal to how good your search is — and they're where most "thin search engine" tools quietly break under real traffic.

MFS is engineered so these are structural non-issues rather than bugs you hit and patch one by one. It rests on two ideas from the Design philosophy — the upstream source is the truth, and every operation is idempotent — and almost everything below follows from them.

A concrete situation What MFS does
Indexing is killed at file 8,000 of 20,000 Work is per object and committed as it goes; the next mfs add picks up around file 8,001, not from scratch.
Your embedding provider runs out of quota halfway and the rest fail The failed objects are recorded and the cursor isn't moved past them; top up the quota and re-run mfs add — it finishes only what's left.
You start searching seconds after mfs add, before indexing finishes Indexing is async and incremental: a chunk is searchable the moment its object lands, and the most useful files are done first — so partial answers are useful right away.
You cancel a sync (or remove the connector) with thousands of tasks still queued Cancel and remove also cancel the queued tasks, not just the stored rows — the queue doesn't keep grinding through work you stopped.
You run the same mfs add twice, at once or back to back At most one sync runs per connector (plus one queued), and every write is idempotent — nothing is processed or stored twice.
Some other tool's background watcher locks the local vector database Only the server ever opens the vector store; every client goes over HTTP, so there's no file for a second process to lock.
You re-index just the 50 files that changed Deletions happen only on a full-set scan; a partial run never removes the 19,950 files it didn't list.
One file with a broken encoding throws Each object succeeds or fails on its own; the bad one is marked failed and skipped, the rest finish.
You switch git branches back and forth Unchanged content hits the cache by content hash — nothing is re-embedded; only genuinely changed files cost anything.
You rename a directory, so every path under it changes The chunks are re-keyed and their vectors reused — no re-embedding, even though every path moved.
An ignore rule or API key set for one source leaks into another Each connector's config is isolated rows in the database, not shared process state.
You switch the embedding model The index is derived — mfs add --force-index rebuilds it from source; the config lives with the connector.

The rest of this page is how.

One server owns the state

MFS is a thin client over one stateful server. The server alone holds the connection to Milvus and owns all state; every client — CLI, SDK, agent skill — talks to it over HTTP. So multiple CLIs or agent sessions never contend over a database file or leak state into one another. The classic embedded-database failure — one process locks the vector file and another can't even open it — has nowhere to occur, because only the server ever touches it.

 other tools — each process opens the store itself:
   CLI ─┐
   watcher ─┼─► milvus.db      ✗ a second opener hits a lock
   search ─┘

 MFS — only the server opens the store:
   CLI ──┐
   SDK ──┼── HTTP /v1 ──► mfs-server ──► Milvus
   skill ┘                 (sole owner)

A ConnectorJob has a real lifecycle

A sync runs as a ConnectorJob in a database-backed queue, not as fire-and-forget work. A uniqueness rule keeps at most one running and one queued sync per connector, so hitting mfs add twice doesn't double-scan or collide — the second is told the sync is already running. Sync is explicit: you trigger it. There's no hidden background scheduler racing itself or re-scanning behind your back. Cancelling a job — or removing the connector — also cancels its queued ObjectTasks, so a stopped job actually stops instead of draining a queue you no longer want.

 mfs add ──► ConnectorJob (running)      ← at most 1 running + 1 queued per connector
                 │ enumerate
   ObjectTask queue:  [t1][t2][t3] … [tN]
                 │ workers pull one at a time
   cancel / remove ──► remaining tasks → cancelled     (the queue actually empties)

Crash-safe and resumable

Work is per-object and atomic, and a run's state is committed at the end. If the process is killed mid-sync — or the embedding provider runs out of quota and the remaining objects fail — nothing is left half-committed: the failed and unfinished objects keep their state, and the next mfs add picks up only what's left.

Recovery collapses to "just run it again", because every write is idempotent. A chunk's primary key is its content address — sha1(namespace + connector + object + chunk kind + locator) — so writing it is a delete-then-insert that any worker, retry, or concurrent run produces identically: a re-run overwrites instead of duplicating, and two sources can never collide on the same key. That's why there's no job retry command, no resume-cursor state machine, no "how far did I get" bookkeeping — crashed means re-run, same result.

 committed as they finish              crash / quota runs out
 [t1✓][t2✓] … [t8000✓]            ╳   [t8001][t8002] … [tN]
                                       └── next `mfs add` resumes here

 re-running any task overwrites by content-addressed id — it never duplicates

Incremental, and careful about deletions

Re-syncing is incremental: per-object fingerprints (and connector cursors) mean only what actually changed is re-converted and re-embedded; everything unchanged is served from cache. Deletion is deliberately conservative — an incremental run never infers deletions. Only a full-set scan, which has truly enumerated the whole source, diffs it and removes what's genuinely gone. A partial or batched ingest therefore can't accidentally delete the records it didn't list this time.

 incremental sync   source yields only what changed ─► update those
                    (nothing yielded = "unchanged", NOT "deleted")

 full-set scan      enumerate everything ─► diff vs index ─► drop the missing
                    └─ the only path that ever removes anything

Progressive availability, important things first

Indexing doesn't block search. mfs add queues the work and returns; objects are processed in the background, and each chunk becomes searchable the moment its object lands — so you can start searching while a large sync is still running, and ls --json shows which objects are indexed, partial, or not_indexed yet.

The order isn't arbitrary. The file connector ranks objects so the highest-signal ones go first — entrypoints like README.md and CLAUDE.md, then core source under src/ / lib/, ahead of tests and generated output. The useful answers tend to be searchable early, long before the last file is done.

 priority:  README ▸ src/ ▸ lib/ ▸ tests/ ▸ dist/     (high-signal first)
 queue:     [█████████ high ──────────────────► low ]
                 │ workers embed
            Index ──grows──►  search hits whatever has landed already
            (ls --json shows each object: indexed / partial / not_indexed)

Reuse instead of recompute

The expensive steps — uploading bytes and calling models — are guarded by two independent caches, so common edits cost almost nothing:

  • Bandwidth. In a client/server setup the file connector keeps a per-path manifest (file_state), so a re-sync uploads only the bytes that changed. A renamed or moved file is matched to its existing entry, so moving a 1 GB file uploads nothing.
  • Model spend. The transformation cache memoizes every embedding, VLM, and summary call, keyed by the content hash and the model. Identical content plus identical model is a hit — across objects, across connectors, even across a collection rebuild or a model rollback.

Two everyday cases fall straight out of this. Switching git branches changes a few files' content but not the rest, so only the genuinely changed files are re-embedded. Renaming a directory moves every path beneath it, but the content is identical — MFS re-keys the affected chunks and reuses their vectors rather than re-embedding the whole subtree the way a path-keyed index would.

 rename dir/old → dir/new    paths change, bytes identical
     chunks re-keyed to new path  +  vectors REUSED        → 0 re-embed

 git switch branch           most files identical
     content-hash hit in the transformation cache          → 0 re-embed
     (only the genuinely changed files cost anything)

Failures stay contained

A worker pool drains jobs with timeouts and a circuit breaker. A single malformed object is marked failed and skipped rather than crashing the run; the breaker aborts a job only when failures pile up, surfaced as a clear code instead of a silent hang. A job that gets stuck times out and resets instead of wedging the queue forever.

 [t1✓][t2 ✗ failed + skipped][t3✓][t4✓] …    ← one bad object, the rest finish
        many ✗ in a row ─► circuit breaker ─► job aborts with a clear code
        stuck task ─► heartbeat timeout ─► reset

Isolation between sources

Per-connector configuration is stored as data in the database, not as mutable state shared inside a long-lived process. Two projects or sources can't leak settings — ignore rules, file extensions, credentials — into each other.

 connector A ─ its config rows ─┐
 connector B ─ its config rows ─┼─► Metadata DB    (each keeps its own rows)
                                no shared in-process state
   → A's ignore rules / keys never bleed into B

These properties are what let the same MFS move from a laptop to production by configuration alone: point the vector backend at a managed Milvus/Zilliz cluster and metadata at Postgres, and the same crash-safe, concurrency-safe, idempotent design handles large corpora and real traffic. See Architecture for the components and Deployment for the topologies.