Architecture¶
MFS is a small CLI layer around local files and Milvus. It has two major data paths:
- ingest: turn files into indexed chunks, summaries, and metadata
- retrieve: combine indexed search with live filesystem browsing
System Position¶
flowchart TB
agent[Shell-based agents<br/>Codex, Claude Code, OpenCode, custom tools]
skill[MFS Skill<br/>search, browse, verify workflow]
cli[MFS CLI<br/>add, search, grep, ls, tree, cat]
files[Local files<br/>memory, skills, transcripts, code, docs]
state[MFS state<br/>~/.mfs/config.toml<br/>queue.json, status.json, converted cache]
milvus[Milvus / Zilliz Cloud<br/>dense vectors, BM25, metadata filters]
agent --> skill
skill --> cli
agent --> cli
cli --> files
cli --> state
cli --> milvus
milvus --> cli
MFS does not mount a filesystem and does not require an always-on service. The
project directory stays clean; derived state lives under ~/.mfs/ by default.
Command Surface¶
┌─────────────── mfs ───────────────┐
│ │
ingest │ add status remove │
│ │
search │ search grep │
│ │
browse │ ls tree cat │
│ │
config │ config path/show/get/set/init │
└────────────────────────────────────┘
Search commands read indexed state. Browse commands read the live filesystem and can be used before indexing.
Ingest Path¶
mfs add <path...> scans files, detects changes, builds chunk tasks, and either
processes them in the foreground or hands them to a short-lived worker.
flowchart LR
input[paths] --> scan[Scanner<br/>ignore rules, extension policy, size limit]
scan --> diff[Diff<br/>disk files vs indexed sources]
diff --> added[added / modified]
diff --> deleted[deleted]
deleted --> drop[delete rows by source]
added --> convert{convertible?}
convert -- PDF/DOCX --> md[Markdown converter<br/>cached in ~/.mfs/converted]
convert -- text/code/markdown --> chunk[Chunker]
md --> chunk
chunk --> tasks[QueueTask records]
tasks --> mode{--sync?}
mode -- yes --> inline[inline batch processing]
mode -- no --> queue[~/.mfs/queue.json]
queue --> worker[detached worker]
inline --> embed[Embedding provider]
worker --> embed
embed --> store[Milvus upsert]
store --> dirs[rebuild directory summaries]
Key points:
mtimeis used as a fast hint; file hash is the content check.--forceskips the mtime shortcut and recomputes hashes.- Modified files only re-queue chunks that changed; unchanged chunk IDs are preserved.
- PDF and DOCX are converted to Markdown before chunking or summarization.
--summarizeand--describeadd enrichment tasks; they are opt-in.
Queue and Worker¶
The async path is intentionally simple.
mfs add .
├─ scan, diff, chunk
├─ write lightweight QueueTask records to ~/.mfs/queue.json
├─ start worker if one is not running
└─ return to caller
worker
├─ dequeue batch
├─ restore chunk text from source file when task_type=embed_ref
├─ call LLM/VLM only for opt-in summary/description tasks
├─ embed texts in batches
├─ upsert rows into Milvus
├─ update ~/.mfs/status.json
├─ rebuild touched directory summaries
└─ exit when queue is empty
The queue is not a durable broker. It stores references and metadata instead of
large raw chunk bodies. If the machine stops mid-index, run mfs add . again or
use mfs add . --force.
Priority ordering is applied before async enqueue:
flowchart TB
root[Entry files<br/>README, SKILL, CLAUDE, INDEX] --> pkg[Package and build metadata]
pkg --> src[Source roots<br/>src, lib, app, services]
src --> docs[Docs and references<br/>docs, guides, manuals]
docs --> examples[Examples and notebooks]
examples --> tests[Tests, fixtures, generated, vendor]
This does not change the final index. It only makes large corpora useful earlier while embedding is still running.
Milvus Collection Model¶
All searchable records share one collection.
mfs_chunks
id primary key, deterministic chunk id
source original file or directory path
parent_dir parent directory path
chunk_index body chunk index, -1 for generated enrichment, 0 for dirs
start_line source start line
end_line source end line
chunk_text searchable text, analyzer-enabled for BM25
dense_vector embedding vector
sparse_vector BM25 sparse vector generated by Milvus
content_type markdown, code, text, llm_summary, vlm_description, directory
file_hash source file hash
is_dir directory summary marker
embed_status complete / pending
metadata JSON details such as headings, language, stale state
account_id tenant or namespace label
flowchart LR
body[Body chunks<br/>chunk_index >= 0] --> collection[(Milvus collection)]
summary[LLM summaries<br/>chunk_index = -1] --> collection
image[VLM image descriptions<br/>chunk_index = -1] --> collection
dir[Directory summaries<br/>is_dir = true] --> collection
collection --> dense[dense vector index]
collection --> sparse[BM25 sparse index]
collection --> scalar[source/path/content filters]
Directory summaries are records too. They let broad queries return a directory
when the directory is the right navigation target, and they feed mfs ls /
mfs tree previews.
Retrieval Path¶
mfs search and mfs grep use different routes.
flowchart TB
query[query] --> mode{search mode}
mode -- semantic --> emb[embed query]
emb --> dense[dense vector search]
mode -- keyword --> bm25[BM25 keyword search]
mode -- hybrid --> both[dense + BM25]
both --> rrf[RRF fusion]
dense --> post[post-process stale paths]
bm25 --> post
rrf --> post
post --> hits[ranked hits<br/>source, lines, content, score, metadata]
Search modes:
| Mode | Route | Best for |
|---|---|---|
hybrid |
dense + BM25 + reciprocal rank fusion | default agent search |
semantic |
dense vector only | paraphrased or conceptual queries |
keyword |
BM25 only | identifiers, exact terms, error codes |
mfs grep is exact search. It first uses indexed BM25 to find likely indexed
files, then verifies matches by reading file lines. For unindexed files under a
scoped directory, it can fall back to system grep.
flowchart LR
pattern[pattern] --> prefilter[Milvus BM25 prefilter]
prefilter --> indexed[read indexed files and regex match lines]
pattern --> unindexed[system grep over unindexed scoped files]
indexed --> matches[grep matches with context lines]
unindexed --> matches
Browse Path¶
Browse commands read files directly. They are not just wrappers around search.
flowchart TB
path[path] --> command{command}
command -- ls --> children[directory children]
command -- tree --> recursive[recursive directory tree]
command -- cat --> file[file content]
children --> density[W/H/D density renderer]
recursive --> density
file --> parser{file shape}
parser -- Markdown --> headings[heading tree]
parser -- Code --> symbols[symbols and excerpts]
parser -- JSON/JSONL --> keys[keys, rows, nested values]
parser -- CSV --> table[headers and sample rows]
parser -- PDF/DOCX --> converted[converted Markdown]
headings --> density
symbols --> density
keys --> density
table --> density
converted --> density
density --> output[text or JSON]
The density controls are shared:
| Control | Meaning |
|---|---|
-W |
width: characters per node, value, paragraph, or summary |
-H |
height: number of headings, rows, entries, or children |
-D |
depth: nested levels to expand |
This is the "look once" layer between ls and full-file cat.
Sync and State¶
sequenceDiagram
participant U as user or agent
participant CLI as mfs add
participant FS as filesystem
participant M as Milvus
participant Q as queue.json
participant W as worker
U->>CLI: mfs add .
CLI->>FS: scan current files
CLI->>M: read indexed {source, file_hash}
CLI->>CLI: compute added / modified / deleted
CLI->>M: delete stale rows
CLI->>Q: enqueue changed chunk refs
CLI->>W: start if needed
W->>Q: dequeue batch
W->>FS: restore chunk text
W->>M: embed and upsert
W->>M: rebuild directory summaries
W->>Q: exit when empty
Default state layout:
~/.mfs/
config.toml
milvus.db
queue.json
queue.json.lock
status.json
worker.log
converted/
<hash>.md
The collection can be rebuilt from files. The files remain the durable source.