Design Philosophy¶
MFS is built as infrastructure for agents: a thin file-search layer over ordinary workspaces, not a replacement filesystem and not a closed knowledge product.
The design can be reduced to five principles.
1. Search and Browse Are Both Necessary¶
MFS gives agents two complementary ways to get information:
- flat global search with
mfs searchandmfs grep - progressive local browsing with
mfs ls,mfs tree, andmfs cat
They solve different jobs. Search answers where might the answer be? Browse answers what is around this result, and what should I read next?
This is close to how people use the web: first use a prebuilt index to find candidate pages, then read previews, then open the specific page and navigate inside it. The index matters because at large scale a blind browse is too slow. The browse step matters because one search result rarely contains all the local context an agent needs.
It is also close to how people use a library: search the catalog first, inspect the right shelf next, then read the exact pages.
MFS keeps both paths available:
mfs search "how does token revocation work" ./docs --mode hybrid
mfs grep "ERR_TOKEN_EXPIRED" ./docs
mfs tree --peek -L 2 ./docs
mfs cat --skim ./docs/auth.md
mfs cat -n 80:140 ./docs/auth.md
MFS searches body chunks directly, not only summaries. This preserves exact anchors such as error codes, function names, config keys, feature flags, table labels, and transcript phrases. Optional generated summaries add another retrieval surface, but they do not replace the original chunks.
2. Browsing Needs a "Look Once" Layer¶
Traditional shell tools leave a gap:
catis too detailed and can waste tokens on whole fileslsandtreeshow names but too little contentgrepshows local matches but not the overall shape
Agents need a fast way to look at a file or directory and understand its rough structure before deciding where to spend context.
Most useful local data already has a tree shape:
| Data | Natural structure |
|---|---|
| Markdown | headings and paragraphs |
| Code | modules, classes, functions, symbols |
| JSON | nested keys and values |
| JSONL | rows, then keys inside each row |
| CSV | headers, rows, cells |
| Directory | child files and subdirectories |
To "look once", MFS renders the trunk before the leaves. It exposes the same
information-density controls across cat, ls, and tree:
| Control | Meaning |
|---|---|
-W |
width: characters per node, value, paragraph, or summary |
-H |
height: number of headings, rows, entries, or children |
-D |
depth: nested levels to expand |
The presets are shortcuts:
--peek: structure only--skim: compact overview--deep: richer context
Markdown is a good example. It already carries semantic structure in headings and first paragraphs, so MFS can show a useful outline without asking an LLM to summarize the whole file.
LLM summaries are still available as an optional enrichment path. They are not required for normal browsing, but they can help broad, macro-level queries land on the right file or directory.
During progressive browsing, agents should not rely only on MFS browse commands.
mfs cat, mfs ls, and mfs tree are designed to work alongside native Linux
tools such as grep, find, sed, awk, and shell pipelines.
3. Files Stay the Source of Truth¶
The user's files are the durable state. Milvus stores a derived index.
That keeps the system predictable:
- deleting
~/.mfs/does not delete user knowledge mfs add .can rebuild the index from the actual files- file deletion should remove derived index records
- Git, editors,
grep,cat, andfindcontinue to work normally - project folders do not receive generated sidecar files
MFS state lives under ~/.mfs/ by default: config, queue, Milvus Lite database,
worker status, logs, and converted PDF/DOCX cache. The indexed project
directory stays clean.
Because files are the source of truth, the index must follow file changes. When files are added, edited, removed, or converted from PDF/DOCX, MFS needs to detect what changed, update the affected chunks, recompute embeddings where needed, and refresh directory summaries.
real files change
-> detect diff
-> update derived index
-> keep search and browse aligned with disk
4. Sync and Queueing Stay Lightweight¶
Synchronization exists because the index is derived state. If a user edits a file, deletes a note, appends a transcript, or converts a PDF/DOCX into cached Markdown, the Milvus rows must eventually match the real files again.
MFS makes synchronization explicit. The user or agent updates the index when they want the indexed view to catch up, or starts a watch loop when the workflow needs continuous updates:
The sync path has three jobs:
flowchart TB
files[Real files<br/>source of truth] --> scan[Scan with ignore rules<br/>and file type policy]
scan --> diff[Diff against indexed<br/>source + file_hash]
diff --> remove[Delete rows<br/>for removed files]
diff --> changed[New or changed files]
changed --> chunks[Chunk changed content]
chunks --> queue[Queue lightweight chunk refs]
queue --> worker[Worker embeds and upserts]
worker --> dirs[Refresh affected<br/>directory summaries]
dirs --> usable[Progressively usable index]
The important part is that MFS compares two views:
left: current disk state
all files that exist now, after ignore rules and size limits
right: indexed state in Milvus
source path, file_hash, chunk metadata, summary rows
diff:
added -> chunk and embed
modified -> rechunk, compare chunk hashes, embed changed chunks
deleted -> remove old rows from Milvus
unchanged -> skip
The queue exists because embedding work should not block every default mfs add
call. It is intentionally much smaller than a broker system. MFS is a CLI tool,
so it avoids Redis, RabbitMQ, and a permanent daemon:
- queue:
~/.mfs/queue.json - lock: filelock around queue writes
- worker: detached Python process
- progress:
~/.mfs/status.json - logs:
~/.mfs/worker.log
Queue entries store references and metadata, not large raw chunk bodies. That
keeps the queue cheap to rewrite and avoids duplicating the user's corpus in
~/.mfs/. When the worker needs text, it reconstructs the chunk from the
current file and the queued range/hash metadata. If the file changed while work
was waiting, the next mfs add reconciles the derived index with the new file
state.
The worker exits after the queue is empty. If a machine stops mid-index, MFS
does not try to provide database-grade job durability; the recovery path is to
rerun mfs add or use mfs add --force. That tradeoff keeps the system easy to
install and easy to delete: the authoritative data is still the user's files.
For large corpora, the index should become useful before every file is done.
MFS therefore prioritizes likely high-value files first: entry files like
README.md and SKILL.md, package metadata, source roots, documentation roots,
recently changed files, and then lower-value generated or fixture paths. The
final index is the same; the early search experience is better.
This supports several sync styles:
| Scenario | Sync style |
|---|---|
| one-time project indexing | mfs add . |
| stronger rebuild pass | mfs add . --force |
| active memory/log append workflow | mfs add . --watch |
| frequent project edits | mfs add . --watch --interval 60s |
5. Everything Should Become Searchable¶
AI workspaces already contain many kinds of knowledge: memory files, raw transcripts, codebases, product documents, customer notes, PDFs, tables, runbooks, and SKILL trees. In enterprise environments this information will only grow. It should be searchable infrastructure, not a pile of disconnected files that every agent has to rediscover from scratch.
MFS starts with the file types agents use today: Markdown, source code, text, PDF, DOCX, JSON, JSONL, CSV, directories, and images with generated descriptions. The goal is not only "search text files"; it is to make useful workspace knowledge addressable by agents.
Different formats enter the system in different ways:
- Markdown, code, text, PDF, and DOCX can be embedded as body chunks.
- PDF and DOCX are converted to Markdown first and cached, so these awkward document formats can be handled directly instead of requiring manual exports.
- JSON, JSONL, CSV, YAML, TOML, HTML, and logs are readable through structured
browse views and searchable with
mfs grep. - Images can become searchable through
mfs add --describe, which stores a VLM text description in the same collection.
The long-term direction is broader: every useful local artifact should have a searchable representation. Images are the current multimodal path through VLM descriptions; future formats could include video, audio, music, screenshots, and other media. The principle is the same whether the source is text, a transcript, a table, a PDF, an image, or future media types: keep the original file as truth, derive the searchable surface, and let the agent search broadly before browsing precisely.