May 19, 2026STATUS EXPLORING
Agent Memory Lab
Agent Memory Lab is a research-style exploration of how AI systems retrieve, rank, and lose memory. The lab investigates semantic retrieval drift, recent context stabilization, summary compression loss, hybrid retrieval, and query-aware ranking policies through small evaluation-driven experiments.
- Status
- EXPLORING
- Live URL
- Open live lab
Exploration note: this page captures an active experiment, so outcomes may be partial while the direction evolves.
Overview
Agent Memory Lab is a research-style exploration focused on how AI systems retrieve, rank, and lose memory.
The project started as a small semantic retrieval experiment, but gradually evolved into a broader investigation of retrieval policies, ranking strategies, memory compression tradeoffs, and context selection problems in AI agent systems.
Instead of building a production chatbot or AI wrapper, the lab focuses on understanding why memory retrieval succeeds or fails under different conditions.
The repository contains:
- semantic retrieval experiments
- hybrid retrieval experiments
- ranking policy iterations
- query-aware retrieval strategies
- evaluation scripts
- observations and failure logs
Why This Lab Exists
Modern AI systems often appear to “remember” information, but memory retrieval is much more complex than simply storing embeddings in a vector database.
During experimentation, several questions became important:
- Why does semantic retrieval drift into unrelated project contexts?
- Why do exact factual memories disappear while broader conceptual memories dominate?
- Why does recent conversational state stabilize retrieval quality?
- Why can high similarity scores still fail to provide the required information?
- Should all queries use the same retrieval strategy?
The goal of the lab became understanding these tradeoffs through small, evaluation-driven experiments.
Core Experiments
Semantic Retrieval
The first experiments tested pure semantic similarity retrieval using embeddings and cosine similarity.
This worked reasonably well for direct conceptual queries, but retrieval quality degraded when:
- the query became ambiguous
- multiple projects shared overlapping concepts
- exact factual details were required
One early observation was retrieval drift: queries containing terms such as “memory”, “retrieval”, or “semantic” sometimes returned adjacent project contexts instead of the intended memory.
Recent Context Stabilization
The lab then explored whether short-term conversational state could stabilize retrieval.
Adding recent context dramatically improved ambiguous queries such as:
How should I continue the system?
Without recent context: the retrieval system mixed:
- MeOS planning
- career positioning
- writing ideas
- generic systems thinking
With recent context mentioning MeOS: retrieval became highly focused on:
- memory architecture
- planning systems
- orchestration
- scope control
Summary Memory Compression
Another experiment explored summary-based memory compression.
A compressed summary memory successfully preserved high-level project meaning, but lost important factual details.
One important failure case:
What was the original name of MeOS?
The summary received a high semantic similarity score even though it did not contain the actual answer.
This revealed an important insight:
Semantic similarity does not guarantee knowledge availability.
Importance-Based Ranking
The lab introduced ranking systems that combined:
- semantic similarity
- memory importance
- project relevance
- recency
This improved retrieval quality for active project workflows, especially in multi-project memory spaces.
However, new problems appeared: high-importance conceptual memories often suppressed smaller factual memories.
Hybrid Retrieval
To address factual retrieval failures, the lab introduced a simple hybrid retrieval strategy:
- semantic similarity
- keyword overlap
This improved exact-detail retrieval and reduced some semantic drift problems.
However, conceptual memories still competed with factual memories inside the same ranking space.
Query-Aware Adaptive Retrieval
One of the most important experiments introduced query-type-aware retrieval policies.
The system classified queries into:
- factual
- conceptual
- planning
Each query type used different ranking priorities.
Examples:
- factual queries increased keyword weight
- planning queries increased recency weight
- conceptual queries increased semantic similarity weight
This significantly improved retrieval behavior compared to using a single universal ranking strategy.
Key Findings
Semantic Similarity Is Not Memory
High semantic similarity scores can create the illusion that the required knowledge exists, even when critical details are missing.
Retrieval Is a Ranking Problem
Memory retrieval quality depends heavily on:
- ranking policies
- retrieval weights
- project relevance
- context selection
not just embeddings.
Different Queries Need Different Retrieval Strategies
A single retrieval policy performed poorly across:
- factual lookup
- conceptual reasoning
- planning continuity
Adaptive retrieval strategies improved overall behavior.
Recent Context Matters
Recent conversational state plays a major role in resolving ambiguous references and maintaining continuity.
Evaluation
The lab introduced small evaluation-driven experiments using:
- expected memory IDs
- Hit@5
- Recall@5
Early retrieval systems produced:
- semantic drift
- project bias
- factual retrieval failures
Later iterations improved retrieval quality through:
- dynamic project detection
- hybrid retrieval
- adaptive ranking policies
One notable improvement: Average Recall@5 improved from approximately:
- 0.51 to:
- 0.66
after introducing project-aware retrieval policies.
Retrieval Failures
Several failure cases became central to the lab:
Semantic Drift
Broad semantic concepts caused unrelated project memories to appear in retrieval results.
Similarity Illusion
High similarity scores often failed to guarantee the presence of the required factual information.
Factual Retrieval Weakness
Exact factual memories were frequently suppressed by broader conceptual memories with:
- higher importance
- stronger semantic density
- higher project relevance
Context Competition
Conceptual memories and factual memories often competed inside the same ranking space.
Current Direction
The lab is still exploratory.
Current focus areas include:
- retrieval policy design
- adaptive ranking systems
- query-aware memory orchestration
- balancing semantic and factual retrieval
- understanding retrieval failures in AI agent systems
The project intentionally avoids becoming a production AI application.
Its purpose is exploring the engineering tradeoffs behind AI memory retrieval systems.
Related labs
No matching experiments linked yet.