Caching¶
The two expensive parts of ingest are converting a source — turning a PDF or Office doc into text — and calling models — embeddings, image (VLM) descriptions, and summaries. MFS keeps two caches so it almost never pays for the same work twice. They sit at different points and are keyed differently, so it's worth knowing which holds what.
| Artifact cache | Transformation cache | |
|---|---|---|
| Holds | converted object bytes — Markdown from a PDF, a structured head preview | model-call outputs — embeddings, image (VLM) descriptions, summaries |
| Keyed by | object + kind | content hash + model |
| Reused | per object, to serve reads | across objects, connectors, and namespaces |
| Bytes live in | the filesystem / object store (a row in artifact_cache points to them) |
its own table (transformation_cache) |
| If you lose it | re-converted on next read | recomputed on next sync |
Neither is a source of truth — both are derived, and losing either costs recompute, not correctness.
Artifact cache¶
When MFS converts an object (a PDF or Office doc → Markdown) or grabs a head
preview of a structured object, it stores the result as an artifact so the
next cat / head / tail reads the converted bytes instead of going back to
the connector and converting again. The kinds in use today are converted_md
(converted document text) and head_cache (a structured-object preview). An
image's VLM description is not an artifact — it's a model output, so it lives in
the transformation cache below.
The bytes live on the filesystem (or object store); the
artifact_cache table is just the index —
keyed by (namespace_id, object_uri, artifact_kind), with a storage_path and a
source_key. That source_key is a freshness token combining the source content
with the converter's identity: on the next sync, if it still matches, the artifact
is reused as-is; if the file changed or the converter was upgraded, it's rebuilt.
So the artifact cache is per object: it's what makes browsing a big PDF or a converted doc fast, and it's the shared input the Job Lane folds when it summarizes a directory — the children are read from here, not re-converted.
Transformation cache¶
Embeddings, image (VLM) descriptions, and summaries are the calls that cost real money and latency. The transformation cache memoizes each one, keyed by content and model rather than by object, so identical input never gets sent to a provider twice. (This is where an image's description lives — it's a model output, not a converted artifact.)
The key is cache_key = sha1(input_hash + kind + provider + model + version +
config) — where input_hash is the hash of the raw input text or bytes. Because
the model identity is in the key, a hit is valid no matter where the same
content shows up: a different object, a different connector, even after the Milvus
collection is rebuilt or the embedding model is rolled back to a prior version.
Each row also records last_hit_at, so the cache can be evicted LRU.
Two more properties matter:
- Single-flight. If several tasks miss on the same key at once, a lock lets exactly one of them make the provider call; the rest wait and reuse the result, instead of all firing the same expensive request.
- Cross-lane. The Object Lane and the Job Lane share this cache, so the Job Lane summarizing an image draws on the same VLM result the Object Lane already computed — never a second call for the same input.
This is what makes the everyday cases in Robustness cheap: switching git branches re-embeds only the files whose content actually changed, and rebuilding the index from scratch still hits the cache for every chunk whose content and model are unchanged.
For the bandwidth side of the story — uploading only changed bytes — see the
file_state manifest in Schema design.