May 19, 20262026

SemanticFS — Local-First Semantic Search for Personal Documents

SemanticFS is a local-first CLI tool that indexes Markdown, text, and text-based PDF files, then lets users search them by meaning instead of exact keywords. It keeps embeddings, metadata, and vector search fully on the user’s machine.

Role
Solo developer — designed and implemented the indexing pipeline, storage model, CLI commands, retrieval flow, and validation suite.
Stack
Python · Typer · Rich · SQLite · ChromaDB · sentence-transformers · PyMuPDF · pytest

Problem

Keyword search fails when the user remembers the meaning of a note but not the exact wording, filename, or folder. Most semantic search tools also require cloud APIs, which is not ideal for private notes, runbooks, and local documents.

Solution

I built a Python CLI that scans local folders, extracts text, chunks documents, generates local embeddings, stores metadata in SQLite, stores vectors in ChromaDB, and returns ranked semantic search results with similarity scores and rule-based “why matched” explanations.

Decisions

  • Used local sentence-transformers instead of cloud embeddings to keep private documents offline
  • Used SQLite for inspectable metadata and ChromaDB for zero-config local vector search
  • Chose CLI-first implementation instead of desktop UI to validate the retrieval engine first
  • Added benchmark and chunk-quality checks before expanding into watcher, OCR, or hybrid search
  • Kept “why matched” rule-based instead of LLM-generated to avoid adding external API dependency

Architecture highlights

  • Local-first retrieval pipeline: scanner → parser → chunker → embedder → SQLite + ChromaDB
  • SQLite stores folder, file, chunk, and index run metadata while ChromaDB stores embeddings
  • Incremental indexing uses SHA-256 content hashes to skip unchanged files
  • Text-based PDF support is separated in the parser layer; scanned PDFs fail explicitly
  • Search results include score, preview, and rule-based explanation instead of opaque output

Outcomes

  • Indexes .md, .txt, and text-based .pdf files locally
  • Supports semantic search without exact keyword overlap
  • Skips unchanged files during re-indexing
  • Includes benchmark corpus, chunk-quality inspection, and test coverage
  • Verified design target: offline usage after initial embedding model download

SemanticFS — Local-First Semantic Search for Personal Documents

Overview

SemanticFS is a local-first semantic search system designed for personal documents and knowledge archives. Instead of relying on exact keyword matching, the system allows users to search files based on meaning and contextual similarity.

The project focuses on offline retrieval infrastructure rather than cloud-dependent AI tooling. Documents are indexed locally, embeddings are generated on-device, and semantic retrieval is performed entirely on the user’s machine.

The system was built as a CLI-first retrieval engine with a modular indexing pipeline and inspectable storage architecture.


Problem

Traditional file search breaks down when users remember concepts instead of exact filenames or keywords.

For example:

  • remembering “that note about incremental indexing”
  • recalling a document by meaning instead of wording
  • searching across fragmented folders and inconsistent naming structures

Most semantic retrieval tools also introduce additional problems:

  • dependency on cloud APIs
  • privacy concerns for personal documents
  • opaque storage systems
  • limited control over indexing behavior

I wanted a retrieval system that could:

  • work fully offline
  • remain inspectable and debuggable
  • support incremental indexing
  • separate metadata storage from vector retrieval
  • provide semantic search without external services

Solution

I built a local semantic indexing pipeline in Python.

The system scans folders recursively, extracts text from supported files, chunks content into searchable segments, generates embeddings locally, and stores both metadata and vector representations for retrieval.

The architecture intentionally separates concerns:

  • SQLite handles structured metadata
  • ChromaDB handles vector similarity search
  • sentence-transformers generates embeddings locally
  • the CLI orchestrates indexing and querying workflows

Search results return:

  • similarity score
  • chunk preview
  • source file path
  • rule-based “why matched” explanations

The project prioritizes transparency and controllability over abstraction-heavy AI workflows.


Architecture

Indexing Pipeline

Scanner
  → Parser
  → Chunker
  → Embedder
  → SQLite Metadata Store
  → ChromaDB Vector Store

Each stage is isolated to keep the system extensible and debuggable.


Metadata Layer

SQLite is used for:

  • indexed file tracking
  • chunk metadata
  • indexing history
  • file hashes
  • incremental indexing logic

This keeps metadata queryable and easy to inspect manually.


Vector Layer

ChromaDB stores embeddings separately from metadata.

This decision keeps retrieval concerns isolated from relational storage and allows future replacement of the vector backend without rewriting indexing logic.


Incremental Indexing

The system computes SHA-256 hashes for indexed files.

If a file has not changed since the previous indexing run, the system skips reprocessing entirely.

This avoids unnecessary embedding generation and significantly reduces indexing cost on large document collections.


PDF Handling

The parser currently supports text-based PDFs only.

Scanned/image-only PDFs intentionally fail with explicit messaging instead of silently producing empty results.

This was a deliberate design decision to avoid unreliable OCR assumptions inside the indexing pipeline.


Key Engineering Decisions

Local Embeddings Instead of Cloud APIs

I intentionally avoided OpenAI-style embedding APIs.

Reasons:

  • offline usage
  • privacy
  • deterministic local behavior
  • no per-query cost
  • reduced infrastructure dependency

SQLite + Chroma Separation

Instead of storing everything inside a single vector database, I separated:

  • structured metadata
  • semantic vectors

This improves inspectability and keeps retrieval architecture modular.


CLI-First Product Direction

I chose to validate the retrieval engine before building a desktop or web interface.

The CLI-first approach made it easier to:

  • debug indexing
  • benchmark retrieval quality
  • inspect chunk outputs
  • iterate on architecture quickly

Rule-Based Match Explanations

Search results include lightweight “why matched” explanations.

I intentionally kept this deterministic instead of using LLM-generated explanations to avoid:

  • extra latency
  • hallucinated reasoning
  • additional API dependencies

Validation & Benchmarking

The project includes:

  • retrieval benchmarks
  • chunk inspection tooling
  • indexing summaries
  • parser failure reporting
  • automated test structure

This was important because semantic retrieval systems can appear functional while producing poor chunk quality or unstable matches internally.

The project focuses on validating retrieval behavior rather than only producing demo outputs.


Current Limitations

Current limitations include:

  • no OCR support for scanned PDFs
  • no real-time filesystem watcher
  • no hybrid keyword + semantic ranking
  • no distributed indexing support
  • CLI-only interface

These limitations are intentional tradeoffs during the current architecture stage.


Future Directions

Potential future expansions include:

  • OCR pipeline support
  • hybrid retrieval ranking
  • local desktop interface
  • filesystem watcher daemon
  • embedding model swapping
  • semantic tagging pipelines
  • retrieval feedback loops

The long-term direction is evolving the project into a controllable local retrieval infrastructure rather than a thin AI wrapper.


Reflection

This project was less about “adding AI to search” and more about designing a retrieval system with clear boundaries and infrastructure decisions.

The most important part of the project was not embedding generation itself, but:

  • storage separation
  • indexing strategy
  • retrieval transparency
  • incremental processing
  • local-first architecture decisions

It became an exercise in building inspectable AI infrastructure instead of opaque automation.

Related projects