Skip to content

Schema design

MFS persists state across a few stores: a metadata database for bookkeeping (it also holds the work queue), a separate transformation-cache database for memoized model outputs, the Milvus collection for searchable vectors, and an object store (the local filesystem by default) for artifact bytes. The two relational databases are SQLite locally and Postgres at scale. This page walks each one. The vocabulary — connector, object, ConnectorJob, ObjectTask, chunk — is defined in Architecture.

The metadata database

One relational database holds everything MFS knows about your sources and the work it's doing on them. The tables:

Table One row per Holds
connectors registered source root URI, type, status, the config JSON and its hash, the credential_ref. Unique on (namespace_id, root_uri).
objects object under a connector object_uri, parent_path, media type, fingerprint (its change token), indexable, search_status, chunk_count. Keyed by (connector_id, object_uri).
connector_jobs ConnectorJob (one sync) status, heartbeat, the object counts (total / succeeded / failed / cancelled), error.
object_tasks ObjectTask (one object's work) change_kind, status, priority, attempts, last_error. This is the durable queue.
connector_state per-connector key/value sync cursors and other small connector-owned state. Keyed by (connector_id, key).
file_state a file in an uploaded tree size, mtime_ns, inode, sha1, status, renamed_from — the upload manifest that makes re-syncs send only what changed.
artifact_cache a derived blob a pointer to the bytes (storage_path) plus a source_key; the bytes themselves live on disk. See Caching.
watch_grants a granted watch path bookkeeping for watch scopes.
schema_version the single integer that lets the server fail fast if a database predates the current schema.

Two design points worth calling out:

  • The queue is a table. object_tasks is the work queue — workers claim rows by (priority, status), so there's no Redis or Celery to run. And connector_jobs carries two partial-unique indexes (at most one running and one preparing/queued job per connector), which is what makes a double mfs add safe at the database level.
  • It's only knowledge, never the truth. Every row here describes an upstream source or derived work; none of it is the source data itself. Drop the whole database and a re-sync rebuilds it.

The transformation-cache database

Model outputs — embeddings, image (VLM) descriptions, summaries — are memoized in their own database, separate from the metadata DB (a standalone SQLite file by default, or a shared Postgres). Keeping it apart is deliberate: it's high-churn, best-effort data whose loss only costs recompute, so it never sits on the metadata DB's transactional path. It has one table:

Table One row per Holds
transformation_cache one memoized model call cache_key (primary key), kind (embedding / vlm / summary), input_hash, provider, model, model_version, output_bytes, hit_count, last_hit_at.

The cache_key folds the input hash together with the provider, model, and version, so a hit is only ever returned for the exact same input and model. See Caching for how it's used.

The Milvus collection

All searchable chunks live in Milvus. By default they share one collection, whose name bakes in the schema version and the embedding dimension (mfs_chunks__v{N}_d{dim}), so changing the schema or swapping the embedding model targets a fresh collection instead of corrupting the current one.

Field Type What it is
chunk_id VARCHAR (primary key) The chunk's content address — sha1(namespace + connector + object + chunk_kind + locator). Makes every write an idempotent upsert.
connector_uri VARCHAR (partition key) Which connector the chunk belongs to.
object_uri VARCHAR Which object it came from — the URI you feed to cat.
locator JSON Where inside the object — {"lines":[42,78]}, {"id":123}, a thread key.
content VARCHAR The text that was embedded and is BM25-analyzed for keyword search.
dense_vec FLOAT_VECTOR The embedding (semantic search).
sparse_vec SPARSE_FLOAT_VECTOR The BM25 vector — generated by Milvus from content, so writers never supply it.
chunk_kind VARCHAR body, row_text, thread_aggregate, directory_summary, …
metadata JSON The connector's metadata_fields (status, author, …) for filtering and display.
namespace_id VARCHAR The namespace, for multi-tenant isolation.
indexed_at INT64 When the chunk was written.

Indexes: dense_vec uses AUTOINDEX with cosine distance, sparse_vec a sparse-inverted BM25 index, and namespace_id / object_uri / chunk_kind get scalar inverted indexes for fast filtering (skipped on Milvus Lite, which falls back to a scan).

Connectors, namespaces, and the partition key

Three things scope a chunk, at three levels:

  • Connector is the partition key (connector_uri). Each connector's chunks sit in their own partition, so a scoped query (mfs search "…" postgres://prod) only touches that connector's partition, while --all fans across them.
  • Namespace is the tenant boundary. By default (collection_strategy = "shared") all namespaces share the one collection and are separated by the namespace_id field; set collection_strategy = "per_namespace" and each namespace gets its own collection (mfs_chunks__{namespace}__v{N}_d{dim}) for hard isolation. Today MFS runs a single default namespace with zero config; per-user / per-workspace namespaces are a v0.5+ direction.
  • Schema and model version are baked into the collection name, so an incompatible change — a new embedding dimension, a schema bump — lands in a fresh collection instead of mixing with the old one.

How the fields map to your data

Beyond those, the fields are just your data, addressed at three levels of granularity:

connector_uri   postgres://prod          ← which source
   object_uri   …/public/tickets/rows.jsonl   ← which object (file / row set / thread)
      locator   {"id": 12345}            ← which slice inside it
      content   "Login broken after SSO migration …"   ← the text that got embedded
   chunk_kind   row_text                 ← what kind of chunk it is
     metadata   {"status":"open","priority":"high"}   ← connector side-fields

A search hit hands back object_uri + locator, which is exactly what cat --locator needs to reopen the exact row, line range, or thread — the index points at your source, it doesn't replace it.