May 19, 20262026

SemanticFS — Local-First Semantic Search for Personal Documents

SemanticFS is a local-first CLI tool that indexes Markdown, text, and text-based PDF files, then lets users search them by meaning instead of exact keywords. It keeps embeddings, metadata, and vector search fully on the user’s machine.

#Ai #Search #Local First

Role: Solo developer — designed and implemented the indexing pipeline, storage model, CLI commands, retrieval flow, and validation suite.
Stack: Python · Typer · Rich · SQLite · ChromaDB · sentence-transformers · PyMuPDF · pytest
Links: Repository

Problem

Keyword search fails when the user remembers the meaning of a note but not the exact wording, filename, or folder. Most semantic search tools also require cloud APIs, which is not ideal for private notes, runbooks, and local documents.

Solution

I built a Python CLI that scans local folders, extracts text, chunks documents, generates local embeddings, stores metadata in SQLite, stores vectors in ChromaDB, and returns ranked semantic search results with similarity scores and rule-based “why matched” explanations.

Decisions

Used local sentence-transformers instead of cloud embeddings to keep private documents offline
Used SQLite for inspectable metadata and ChromaDB for zero-config local vector search
Chose CLI-first implementation instead of desktop UI to validate the retrieval engine first
Added benchmark and chunk-quality checks before expanding into watcher, OCR, or hybrid search
Kept “why matched” rule-based instead of LLM-generated to avoid adding external API dependency

Architecture highlights

Local-first retrieval pipeline: scanner → parser → chunker → embedder → SQLite + ChromaDB
SQLite stores folder, file, chunk, and index run metadata while ChromaDB stores embeddings
Incremental indexing uses SHA-256 content hashes to skip unchanged files
Text-based PDF support is separated in the parser layer; scanned PDFs fail explicitly
Search results include score, preview, and rule-based explanation instead of opaque output

Outcomes

Indexes .md, .txt, and text-based .pdf files locally
Supports semantic search without exact keyword overlap
Skips unchanged files during re-indexing
Includes benchmark corpus, chunk-quality inspection, and test coverage
Verified design target: offline usage after initial embedding model download

SemanticFS — Local-First Semantic Search for Personal Documents

Overview

SemanticFS is a local-first semantic search system designed for personal documents and knowledge archives. Instead of relying on exact keyword matching, the system allows users to search files based on meaning and contextual similarity.

The project focuses on offline retrieval infrastructure rather than cloud-dependent AI tooling. Documents are indexed locally, embeddings are generated on-device, and semantic retrieval is performed entirely on the user’s machine.

The system was built as a CLI-first retrieval engine with a modular indexing pipeline and inspectable storage architecture.

Preview

SemanticFS terminal search result showing ranked semantic matches, similarity scores, file paths, and why-matched explanations

Problem

Traditional file search breaks down when users remember concepts instead of exact filenames or keywords.

For example:

remembering “that note about incremental indexing”
recalling a document by meaning instead of wording
searching across fragmented folders and inconsistent naming structures

Most semantic retrieval tools also introduce additional problems:

dependency on cloud APIs
privacy concerns for personal documents
opaque storage systems
limited control over indexing behavior

I wanted a retrieval system that could:

work fully offline
remain inspectable and debuggable
support incremental indexing
separate metadata storage from vector retrieval
provide semantic search without external services

Solution

I built a local semantic indexing pipeline in Python.

The system scans folders recursively, extracts text from supported files, chunks content into searchable segments, generates embeddings locally, and stores both metadata and vector representations for retrieval.

The architecture intentionally separates concerns:

SQLite handles structured metadata
ChromaDB handles vector similarity search
sentence-transformers generates embeddings locally
the CLI orchestrates indexing and querying workflows

Search results return:

similarity score
chunk preview
source file path
rule-based “why matched” explanations

The project prioritizes transparency and controllability over abstraction-heavy AI workflows.

Architecture

Indexing Pipeline

Scanner
  → Parser
  → Chunker
  → Embedder
  → SQLite Metadata Store
  → ChromaDB Vector Store

Each stage is isolated to keep the system extensible and debuggable.

Metadata Layer

SQLite is used for:

indexed file tracking
chunk metadata
indexing history
file hashes
incremental indexing logic

This keeps metadata queryable and easy to inspect manually.

Vector Layer

ChromaDB stores embeddings separately from metadata.

This decision keeps retrieval concerns isolated from relational storage and allows future replacement of the vector backend without rewriting indexing logic.

Incremental Indexing

The system computes SHA-256 hashes for indexed files.

If a file has not changed since the previous indexing run, the system skips reprocessing entirely.

This avoids unnecessary embedding generation and significantly reduces indexing cost on large document collections.

PDF Handling

The parser currently supports text-based PDFs only.

Scanned/image-only PDFs intentionally fail with explicit messaging instead of silently producing empty results.

This was a deliberate design decision to avoid unreliable OCR assumptions inside the indexing pipeline.

Key Engineering Decisions

Local Embeddings Instead of Cloud APIs

I intentionally avoided OpenAI-style embedding APIs.

Reasons:

offline usage
privacy
deterministic local behavior
no per-query cost
reduced infrastructure dependency

SQLite + Chroma Separation

Instead of storing everything inside a single vector database, I separated:

structured metadata
semantic vectors

This improves inspectability and keeps retrieval architecture modular.

CLI-First Product Direction

I chose to validate the retrieval engine before building a desktop or web interface.

The CLI-first approach made it easier to:

debug indexing
benchmark retrieval quality
inspect chunk outputs
iterate on architecture quickly

Rule-Based Match Explanations

Search results include lightweight “why matched” explanations.

I intentionally kept this deterministic instead of using LLM-generated explanations to avoid:

extra latency
hallucinated reasoning
additional API dependencies

Validation & Benchmarking

The project includes:

retrieval benchmarks
chunk inspection tooling
indexing summaries
parser failure reporting
automated test structure

This was important because semantic retrieval systems can appear functional while producing poor chunk quality or unstable matches internally.

The project focuses on validating retrieval behavior rather than only producing demo outputs.

Current Limitations

Current limitations include:

no OCR support for scanned PDFs
no real-time filesystem watcher
no hybrid keyword + semantic ranking
no distributed indexing support
CLI-only interface

These limitations are intentional tradeoffs during the current architecture stage.

Future Directions

Potential future expansions include:

OCR pipeline support
hybrid retrieval ranking
local desktop interface
filesystem watcher daemon
embedding model swapping
semantic tagging pipelines
retrieval feedback loops

The long-term direction is evolving the project into a controllable local retrieval infrastructure rather than a thin AI wrapper.

Reflection

This project was less about “adding AI to search” and more about designing a retrieval system with clear boundaries and infrastructure decisions.

The most important part of the project was not embedding generation itself, but:

storage separation
indexing strategy
retrieval transparency
incremental processing
local-first architecture decisions

It became an exercise in building inspectable AI infrastructure instead of opaque automation.

Problem

Solution

Decisions

Architecture highlights

Outcomes

SemanticFS — Local-First Semantic Search for Personal Documents

Overview

Preview

Problem

Solution

Architecture

Indexing Pipeline

Metadata Layer

Vector Layer

Incremental Indexing

PDF Handling

Key Engineering Decisions

Local Embeddings Instead of Cloud APIs

SQLite + Chroma Separation

CLI-First Product Direction

Rule-Based Match Explanations

Validation & Benchmarking

Current Limitations

Future Directions

Reflection

Related projects

SemanticFS — Local-First Semantic Search for Personal Documents

Overview

Preview

Problem

Solution

Architecture

Indexing Pipeline

Metadata Layer

Vector Layer

Incremental Indexing

PDF Handling

Key Engineering Decisions

Local Embeddings Instead of Cloud APIs

SQLite + Chroma Separation

CLI-First Product Direction

Rule-Based Match Explanations

Validation & Benchmarking

Current Limitations

Future Directions

Reflection

Related projects