Skip to content

Architecture

MFS is a small CLI layer around local files and Milvus. It has two major data paths:

  • ingest: turn files into indexed chunks, summaries, and metadata
  • retrieve: combine indexed search with live filesystem browsing

System Position

flowchart TB
  agent[Shell-based agents<br/>Codex, Claude Code, OpenCode, custom tools]
  skill[MFS Skill<br/>search, browse, verify workflow]
  cli[MFS CLI<br/>add, search, grep, ls, tree, cat]
  files[Local files<br/>memory, skills, transcripts, code, docs]
  state[MFS state<br/>~/.mfs/config.toml<br/>queue.json, status.json, converted cache]
  milvus[Milvus / Zilliz Cloud<br/>dense vectors, BM25, metadata filters]

  agent --> skill
  skill --> cli
  agent --> cli
  cli --> files
  cli --> state
  cli --> milvus
  milvus --> cli

MFS does not mount a filesystem and does not require an always-on service. The project directory stays clean; derived state lives under ~/.mfs/ by default.

Command Surface

                ┌─────────────── mfs ───────────────┐
                │                                    │
        ingest  │  add      status      remove       │
                │                                    │
        search  │  search   grep                    │
                │                                    │
        browse  │  ls       tree       cat           │
                │                                    │
        config  │  config path/show/get/set/init     │
                └────────────────────────────────────┘

Search commands read indexed state. Browse commands read the live filesystem and can be used before indexing.

Ingest Path

mfs add <path...> scans files, detects changes, builds chunk tasks, and either processes them in the foreground or hands them to a short-lived worker.

flowchart LR
  input[paths] --> scan[Scanner<br/>ignore rules, extension policy, size limit]
  scan --> diff[Diff<br/>disk files vs indexed sources]
  diff --> added[added / modified]
  diff --> deleted[deleted]
  deleted --> drop[delete rows by source]
  added --> convert{convertible?}
  convert -- PDF/DOCX --> md[Markdown converter<br/>cached in ~/.mfs/converted]
  convert -- text/code/markdown --> chunk[Chunker]
  md --> chunk
  chunk --> tasks[QueueTask records]
  tasks --> mode{--sync?}
  mode -- yes --> inline[inline batch processing]
  mode -- no --> queue[~/.mfs/queue.json]
  queue --> worker[detached worker]
  inline --> embed[Embedding provider]
  worker --> embed
  embed --> store[Milvus upsert]
  store --> dirs[rebuild directory summaries]

Key points:

  • mtime is used as a fast hint; file hash is the content check.
  • --force skips the mtime shortcut and recomputes hashes.
  • Modified files only re-queue chunks that changed; unchanged chunk IDs are preserved.
  • PDF and DOCX are converted to Markdown before chunking or summarization.
  • --summarize and --describe add enrichment tasks; they are opt-in.

Queue and Worker

The async path is intentionally simple.

mfs add .
  ├─ scan, diff, chunk
  ├─ write lightweight QueueTask records to ~/.mfs/queue.json
  ├─ start worker if one is not running
  └─ return to caller

worker
  ├─ dequeue batch
  ├─ restore chunk text from source file when task_type=embed_ref
  ├─ call LLM/VLM only for opt-in summary/description tasks
  ├─ embed texts in batches
  ├─ upsert rows into Milvus
  ├─ update ~/.mfs/status.json
  ├─ rebuild touched directory summaries
  └─ exit when queue is empty

The queue is not a durable broker. It stores references and metadata instead of large raw chunk bodies. If the machine stops mid-index, run mfs add . again or use mfs add . --force.

Priority ordering is applied before async enqueue:

flowchart TB
  root[Entry files<br/>README, SKILL, CLAUDE, INDEX] --> pkg[Package and build metadata]
  pkg --> src[Source roots<br/>src, lib, app, services]
  src --> docs[Docs and references<br/>docs, guides, manuals]
  docs --> examples[Examples and notebooks]
  examples --> tests[Tests, fixtures, generated, vendor]

This does not change the final index. It only makes large corpora useful earlier while embedding is still running.

Milvus Collection Model

All searchable records share one collection.

mfs_chunks
  id             primary key, deterministic chunk id
  source         original file or directory path
  parent_dir     parent directory path
  chunk_index    body chunk index, -1 for generated enrichment, 0 for dirs
  start_line     source start line
  end_line       source end line
  chunk_text     searchable text, analyzer-enabled for BM25
  dense_vector   embedding vector
  sparse_vector  BM25 sparse vector generated by Milvus
  content_type   markdown, code, text, llm_summary, vlm_description, directory
  file_hash      source file hash
  is_dir         directory summary marker
  embed_status   complete / pending
  metadata       JSON details such as headings, language, stale state
  account_id     tenant or namespace label
flowchart LR
  body[Body chunks<br/>chunk_index >= 0] --> collection[(Milvus collection)]
  summary[LLM summaries<br/>chunk_index = -1] --> collection
  image[VLM image descriptions<br/>chunk_index = -1] --> collection
  dir[Directory summaries<br/>is_dir = true] --> collection
  collection --> dense[dense vector index]
  collection --> sparse[BM25 sparse index]
  collection --> scalar[source/path/content filters]

Directory summaries are records too. They let broad queries return a directory when the directory is the right navigation target, and they feed mfs ls / mfs tree previews.

Retrieval Path

mfs search and mfs grep use different routes.

flowchart TB
  query[query] --> mode{search mode}
  mode -- semantic --> emb[embed query]
  emb --> dense[dense vector search]
  mode -- keyword --> bm25[BM25 keyword search]
  mode -- hybrid --> both[dense + BM25]
  both --> rrf[RRF fusion]
  dense --> post[post-process stale paths]
  bm25 --> post
  rrf --> post
  post --> hits[ranked hits<br/>source, lines, content, score, metadata]

Search modes:

Mode Route Best for
hybrid dense + BM25 + reciprocal rank fusion default agent search
semantic dense vector only paraphrased or conceptual queries
keyword BM25 only identifiers, exact terms, error codes

mfs grep is exact search. It first uses indexed BM25 to find likely indexed files, then verifies matches by reading file lines. For unindexed files under a scoped directory, it can fall back to system grep.

flowchart LR
  pattern[pattern] --> prefilter[Milvus BM25 prefilter]
  prefilter --> indexed[read indexed files and regex match lines]
  pattern --> unindexed[system grep over unindexed scoped files]
  indexed --> matches[grep matches with context lines]
  unindexed --> matches

Browse Path

Browse commands read files directly. They are not just wrappers around search.

flowchart TB
  path[path] --> command{command}
  command -- ls --> children[directory children]
  command -- tree --> recursive[recursive directory tree]
  command -- cat --> file[file content]
  children --> density[W/H/D density renderer]
  recursive --> density
  file --> parser{file shape}
  parser -- Markdown --> headings[heading tree]
  parser -- Code --> symbols[symbols and excerpts]
  parser -- JSON/JSONL --> keys[keys, rows, nested values]
  parser -- CSV --> table[headers and sample rows]
  parser -- PDF/DOCX --> converted[converted Markdown]
  headings --> density
  symbols --> density
  keys --> density
  table --> density
  converted --> density
  density --> output[text or JSON]

The density controls are shared:

Control Meaning
-W width: characters per node, value, paragraph, or summary
-H height: number of headings, rows, entries, or children
-D depth: nested levels to expand

This is the "look once" layer between ls and full-file cat.

Sync and State

sequenceDiagram
  participant U as user or agent
  participant CLI as mfs add
  participant FS as filesystem
  participant M as Milvus
  participant Q as queue.json
  participant W as worker

  U->>CLI: mfs add .
  CLI->>FS: scan current files
  CLI->>M: read indexed {source, file_hash}
  CLI->>CLI: compute added / modified / deleted
  CLI->>M: delete stale rows
  CLI->>Q: enqueue changed chunk refs
  CLI->>W: start if needed
  W->>Q: dequeue batch
  W->>FS: restore chunk text
  W->>M: embed and upsert
  W->>M: rebuild directory summaries
  W->>Q: exit when empty

Default state layout:

~/.mfs/
  config.toml
  milvus.db
  queue.json
  queue.json.lock
  status.json
  worker.log
  converted/
    <hash>.md

The collection can be rebuilt from files. The files remain the durable source.