Local files (file)¶
The file connector indexes a local directory tree — a code repo, a docs
folder, a dump of PDFs. It's the connector you'll reach for first, and the only
one with no credentials and no optional dependencies: it's always available.
You point it at a directory and every file underneath becomes a searchable,
browsable object at a stable URI. Re-run mfs add to re-sync after the files
change.
How MFS sees it¶
Every file keeps its real name and extension under the connector root. On the
same host the server identity for a path is file://local/abs/path; an uploaded
tree (see upload mode)
is keyed by the client id instead: file://<client_id><abs-root>/....
file://local/home/alice/project/
├── README.md document → converted, embedded, searchable
├── src/
│ └── engine.py code → embedded, searchable
├── docs/spec.pdf document → converted, embedded, searchable
├── data/users.csv text_blob → browse + grep, NOT semantically searchable
└── build/app.bin binary → browse / export only
What gets indexed¶
The connector classifies each file by extension, and the class decides whether it becomes part of semantic search:
| Class | Extensions | In semantic search? | You can still… |
|---|---|---|---|
code |
.py .js .ts .tsx .go .rs .java .c .cpp .rb .php .sql .sh … |
Yes — embedded as code chunks | cat, grep |
document |
.md .txt .rst .pdf .docx .pptx .xlsx .html |
Yes — converted to text, then embedded | cat, grep |
image |
.png .jpg .gif .webp .svg … |
Only when VLM descriptions are enabled | export |
text_blob |
.json .csv .tsv .yaml .toml .ini .log .jsonl .ndjson |
No — not embedded | cat, grep |
binary |
everything else | No | cat (raw), export |
The distinction that surprises people: structured text like .json, .csv, and
.yaml is browseable and greppable but not in semantic search. mfs search
won't rank a CSV row; mfs grep and mfs cat will still find and read it. If you
want a CSV to be semantically searchable, ingest it through a database connector
instead, where rows become first-class records.
Configuration¶
The simple case needs no TOML at all:
Optional TOML tunes the edges — cap file size, or mark a subtree browse-only:
max_file_bytes = 5_000_000
[[objects]]
match = "/data/exports"
indexable = false # stays browseable, never embedded
indexable = false keeps an object in the tree for ls/cat but skips the
embedding step.
What's left out of the tree¶
The connector honors ignore rules so you don't index noise:
- Built-in defaults drop
node_modules/,.git/,__pycache__/,*.pyc, and similar generated paths. - A
.gitignoreor.mfsignoreat the root extends the ignore set. - Files over
max_file_bytesare skipped.
Symlinks are resolved inside the connector root; any path that escapes the root
(../secret) is rejected outright.
Sync and freshness¶
Re-running mfs add <path> re-syncs. The connector has no remote cursor — it does
a full scan and diffs against indexed state, so files added, changed, and
deleted are all reflected on the next sync (delete_detection = full_scan).
Unchanged files aren't re-embedded; the transformation cache already holds their
vectors. grep runs as a pushdown directly over the files rather than over the
index, so it works even before indexing finishes.