May 19, 20262026
SemanticFS — Local-First Semantic Search for Personal Documents
SemanticFS is a local-first CLI tool that indexes Markdown, text, and text-based PDF files, then lets users search them by meaning instead of exact keywords. It keeps embeddings, metadata, and vector search fully on the user’s machine.
- Role
- Solo developer — designed and implemented the indexing pipeline, storage model, CLI commands, retrieval flow, and validation suite.
- Stack
- Python · Typer · Rich · SQLite · ChromaDB · sentence-transformers · PyMuPDF · pytest
- Links
- Repository
Problem
Keyword search fails when the user remembers the meaning of a note but not the exact wording, filename, or folder. Most semantic search tools also require cloud APIs, which is not ideal for private notes, runbooks, and local documents.
Solution
I built a Python CLI that scans local folders, extracts text, chunks documents, generates local embeddings, stores metadata in SQLite, stores vectors in ChromaDB, and returns ranked semantic search results with similarity scores and rule-based “why matched” explanations.
Decisions
- Used local sentence-transformers instead of cloud embeddings to keep private documents offline
- Used SQLite for inspectable metadata and ChromaDB for zero-config local vector search
- Chose CLI-first implementation instead of desktop UI to validate the retrieval engine first
- Added benchmark and chunk-quality checks before expanding into watcher, OCR, or hybrid search
- Kept “why matched” rule-based instead of LLM-generated to avoid adding external API dependency
Architecture highlights
- Local-first retrieval pipeline: scanner → parser → chunker → embedder → SQLite + ChromaDB
- SQLite stores folder, file, chunk, and index run metadata while ChromaDB stores embeddings
- Incremental indexing uses SHA-256 content hashes to skip unchanged files
- Text-based PDF support is separated in the parser layer; scanned PDFs fail explicitly
- Search results include score, preview, and rule-based explanation instead of opaque output
Outcomes
- Indexes .md, .txt, and text-based .pdf files locally
- Supports semantic search without exact keyword overlap
- Skips unchanged files during re-indexing
- Includes benchmark corpus, chunk-quality inspection, and test coverage
- Verified design target: offline usage after initial embedding model download
SemanticFS — Local-First Semantic Search for Personal Documents
Overview
SemanticFS is a local-first semantic search system designed for personal documents and knowledge archives. Instead of relying on exact keyword matching, the system allows users to search files based on meaning and contextual similarity.
The project focuses on offline retrieval infrastructure rather than cloud-dependent AI tooling. Documents are indexed locally, embeddings are generated on-device, and semantic retrieval is performed entirely on the user’s machine.
The system was built as a CLI-first retrieval engine with a modular indexing pipeline and inspectable storage architecture.
Problem
Traditional file search breaks down when users remember concepts instead of exact filenames or keywords.
For example:
- remembering “that note about incremental indexing”
- recalling a document by meaning instead of wording
- searching across fragmented folders and inconsistent naming structures
Most semantic retrieval tools also introduce additional problems:
- dependency on cloud APIs
- privacy concerns for personal documents
- opaque storage systems
- limited control over indexing behavior
I wanted a retrieval system that could:
- work fully offline
- remain inspectable and debuggable
- support incremental indexing
- separate metadata storage from vector retrieval
- provide semantic search without external services
Solution
I built a local semantic indexing pipeline in Python.
The system scans folders recursively, extracts text from supported files, chunks content into searchable segments, generates embeddings locally, and stores both metadata and vector representations for retrieval.
The architecture intentionally separates concerns:
- SQLite handles structured metadata
- ChromaDB handles vector similarity search
- sentence-transformers generates embeddings locally
- the CLI orchestrates indexing and querying workflows
Search results return:
- similarity score
- chunk preview
- source file path
- rule-based “why matched” explanations
The project prioritizes transparency and controllability over abstraction-heavy AI workflows.
Architecture
Indexing Pipeline
Scanner
→ Parser
→ Chunker
→ Embedder
→ SQLite Metadata Store
→ ChromaDB Vector Store
Each stage is isolated to keep the system extensible and debuggable.
Metadata Layer
SQLite is used for:
- indexed file tracking
- chunk metadata
- indexing history
- file hashes
- incremental indexing logic
This keeps metadata queryable and easy to inspect manually.
Vector Layer
ChromaDB stores embeddings separately from metadata.
This decision keeps retrieval concerns isolated from relational storage and allows future replacement of the vector backend without rewriting indexing logic.
Incremental Indexing
The system computes SHA-256 hashes for indexed files.
If a file has not changed since the previous indexing run, the system skips reprocessing entirely.
This avoids unnecessary embedding generation and significantly reduces indexing cost on large document collections.
PDF Handling
The parser currently supports text-based PDFs only.
Scanned/image-only PDFs intentionally fail with explicit messaging instead of silently producing empty results.
This was a deliberate design decision to avoid unreliable OCR assumptions inside the indexing pipeline.
Key Engineering Decisions
Local Embeddings Instead of Cloud APIs
I intentionally avoided OpenAI-style embedding APIs.
Reasons:
- offline usage
- privacy
- deterministic local behavior
- no per-query cost
- reduced infrastructure dependency
SQLite + Chroma Separation
Instead of storing everything inside a single vector database, I separated:
- structured metadata
- semantic vectors
This improves inspectability and keeps retrieval architecture modular.
CLI-First Product Direction
I chose to validate the retrieval engine before building a desktop or web interface.
The CLI-first approach made it easier to:
- debug indexing
- benchmark retrieval quality
- inspect chunk outputs
- iterate on architecture quickly
Rule-Based Match Explanations
Search results include lightweight “why matched” explanations.
I intentionally kept this deterministic instead of using LLM-generated explanations to avoid:
- extra latency
- hallucinated reasoning
- additional API dependencies
Validation & Benchmarking
The project includes:
- retrieval benchmarks
- chunk inspection tooling
- indexing summaries
- parser failure reporting
- automated test structure
This was important because semantic retrieval systems can appear functional while producing poor chunk quality or unstable matches internally.
The project focuses on validating retrieval behavior rather than only producing demo outputs.
Current Limitations
Current limitations include:
- no OCR support for scanned PDFs
- no real-time filesystem watcher
- no hybrid keyword + semantic ranking
- no distributed indexing support
- CLI-only interface
These limitations are intentional tradeoffs during the current architecture stage.
Future Directions
Potential future expansions include:
- OCR pipeline support
- hybrid retrieval ranking
- local desktop interface
- filesystem watcher daemon
- embedding model swapping
- semantic tagging pipelines
- retrieval feedback loops
The long-term direction is evolving the project into a controllable local retrieval infrastructure rather than a thin AI wrapper.
Reflection
This project was less about “adding AI to search” and more about designing a retrieval system with clear boundaries and infrastructure decisions.
The most important part of the project was not embedding generation itself, but:
- storage separation
- indexing strategy
- retrieval transparency
- incremental processing
- local-first architecture decisions
It became an exercise in building inspectable AI infrastructure instead of opaque automation.