May 19, 20262026

SemanticFS — Local-First Semantic Search for Personal Documents

SemanticFS is a local-first CLI tool that indexes Markdown, text, and text-based PDF files, then lets users search them by meaning instead of exact keywords. It keeps embeddings, metadata, and vector search fully on the user’s machine.

Role
Solo developer — designed and implemented the indexing pipeline, storage model, CLI commands, retrieval flow, and validation suite.
Stack
Python · Typer · Rich · SQLite · ChromaDB · sentence-transformers · PyMuPDF · pytest

Problem

Keyword search fails when the user remembers the meaning of a note but not the exact wording, filename, or folder. Most semantic search tools also require cloud APIs, which is not ideal for private notes, runbooks, and local documents.

Solution

I built a Python CLI that scans local folders, extracts text, chunks documents, generates local embeddings, stores metadata in SQLite, stores vectors in ChromaDB, and returns ranked semantic search results with similarity scores and rule-based “why matched” explanations.

Decisions

  • Used local sentence-transformers instead of cloud embeddings to keep private documents offline
  • Used SQLite for inspectable metadata and ChromaDB for zero-config local vector search
  • Chose CLI-first implementation instead of desktop UI to validate the retrieval engine first
  • Added benchmark and chunk-quality checks before expanding into watcher, OCR, or hybrid search
  • Kept “why matched” rule-based instead of LLM-generated to avoid adding external API dependency

Architecture highlights

  • Local-first retrieval pipeline: scanner → parser → chunker → embedder → SQLite + ChromaDB
  • SQLite stores folder, file, chunk, and index run metadata while ChromaDB stores embeddings
  • Incremental indexing uses SHA-256 content hashes to skip unchanged files
  • Text-based PDF support is separated in the parser layer; scanned PDFs fail explicitly
  • Search results include score, preview, and rule-based explanation instead of opaque output

Outcomes

  • Indexes .md, .txt, and text-based .pdf files locally
  • Supports semantic search without exact keyword overlap
  • Skips unchanged files during re-indexing
  • Includes benchmark corpus, chunk-quality inspection, and test coverage
  • Verified design target: offline usage after initial embedding model download