Everyone Builds Storage
In October 2025, Mem0 raised $24 million in Series A funding. Their pitch: a memory layer for AI agents. Vector embeddings in, vector embeddings out. Store conversation history, extract atomic facts, retrieve them later by semantic similarity.
Same month, Cognee raised 7.5 million euros. Knowledge graphs for AI agents. Ingest data from 30+ source types, build structured relationships, query them at runtime.
Zep published a paper on temporal knowledge graphs — tracking not just what's true but when it was true. Letta (formerly MemGPT) rebuilt its entire agent loop for frontier models. Google shipped an Always-On Memory Agent on top of Gemini. AWS launched AgentCore Memory as a managed service. GitHub made Copilot Memory generally available.
Tens of millions of dollars. Multiple labs. A dozen startups. Hundreds of open-source contributors. All converging on the same problem: AI agents forget between sessions.
I wanted to understand what they built. So I looked at all of them.
Start at the top. The companies building the models are also building the memory.
Anthropic ships two things: CLAUDE.md (a static instruction file loaded at session start) and Auto Memory (the model writes notes to itself — build commands, patterns, architectural decisions, loaded automatically next session). Their engineering guidance frames memory as context engineering: "find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome." The philosophy is curation over accumulation.
OpenAI has saved memories (explicit, user-directed) and chat history (implicit, model-extracted). The consumer product remembers preferences and recent topics. The developer SDK has session history but no cross-session persistence — that's your problem.
Google is the most architecturally ambitious. They have the largest context window in the industry — one million tokens — and they're also building a separate memory agent on top of it. A Google PM open-sourced an Always-On Memory Agent in March 2026 that ingests information continuously, consolidates it during idle periods, and retrieves it without a vector database. Google has the biggest window AND doesn't think the window is enough. That's the clearest signal in the entire landscape.
Meta offers raw scale — Llama 4 Scout supports 10 million tokens — but no dedicated memory architecture. Their consumer product injects remembered facts into the prompt at runtime, consuming part of the window rather than extending it.
Four labs. Four approaches. But zoom out and squint: Anthropic stores facts in files. OpenAI stores facts in a database. Google stores facts in a memory agent. Meta stores facts in the prompt. Different architectures, same output: stored facts, retrieved later.
The startup ecosystem is more diverse architecturally. Same conclusion.
Mem0 ($24M, ~48K GitHub stars): Extracts memories from interactions, stores them in a dual-store (vector + knowledge graph), retrieves by semantic similarity. On LongMemEval — the most rigorous evaluation of long-horizon conversational memory — Mem0 scores 49%. It compresses roughly 101,600 tokens of conversation into about 2,900 tokens of extracted facts. That compression ratio tells you what the system values: atomic facts. Everything else is discarded.
Zep (built on Graphiti): Temporal knowledge graph. Tracks when facts were true, not just what was true. When a user's address changes, the old version is invalidated but preserved. Up to 18.5% accuracy improvement over alternatives on LongMemEval. The most sophisticated approach to facts that change over time.
Cognee (7.5M euros): Pipeline-based extraction from 30+ data source types, building a queryable knowledge graph. Not semantic search — structured relationships between entities. Best for workflows where connections between facts matter.
Letta (formerly MemGPT): The most architecturally distinct. Treats the context window like RAM in an operating system. Core memory (always in context, agent-editable), recall memory (full interaction history on disk), archival memory (structured knowledge in external storage). Sleep-time compute consolidates memories asynchronously during idle periods. It's the full OS metaphor — main memory, swap, filesystem.
Hindsight (~4K stars): Multi-strategy retrieval — semantic search, BM25 keyword, entity graph, and temporal indexing — synthesized by an LLM reflection step. Claims 91.4% retrieval accuracy on LongMemEval. The highest reported.
Seven startups. Seven architectures: vector stores, knowledge graphs, temporal graphs, OS-inspired tiers, multi-strategy retrieval, key-value stores, managed services. Wildly different engineering. And every single one produces the same output: stored facts, retrieved later.
The open-source ecosystem around coding agents is smaller and scrappier. Same pattern.
claude-mem: Captures tool-usage observations during sessions, compresses them, injects relevant context into future sessions. SQLite + Chroma vector database. Passive capture — but the agent has to call the search tool to retrieve. If it doesn't call, relevant context is silently lost.
memsearch (Zilliz): Markdown-first memory. Shell hooks plus a background watcher. Memories auto-injected into every prompt. Trade-off: always consumes context space.
ContextForge: Cloud-hosted MCP server. Organizes context into Projects, Spaces, and Documents. Semantic search, git integration, task tracking.
mcp-memory-service: REST API + knowledge graph + autonomous consolidation. Cross-agent knowledge sharing via graph.
Five community tools. Shell hooks, vector databases, knowledge graphs, MCP servers, markdown files. Different mechanisms for the same thing: get facts from last session into this session.
The coding assistants handle it too.
Cursor: .cursorrules (static instructions), @codebase (semantic repo index), Notepad (persistent context surviving across sessions). Community workaround for context drift: start a new chat after 20 messages.
Aider: The repo map — tree-sitter extracts symbol definitions, a graph algorithm weights files by dependency centrality, the map is dynamically sized against a token budget. The most technically rigorous approach to structural facts. Auto-generated from code rather than hand-maintained.
Copilot: Memory is now GA. Agent-driven fact discovery — Copilot finds and stores coding conventions, architectural patterns, cross-file dependencies. Citation validation: before using a memory, Copilot checks the source against the current codebase. Stale memories aren't applied. 28-day auto-expiry. The most mature native implementation.
Windsurf (acquired by Cognition/Devin): The one outlier worth noting. Cascade tracks all IDE actions in real time — file edits, terminal runs, clipboard content, navigation history. Dynamic context, not stored context. It observes what the developer is doing rather than asking them to explain it. The closest any tool comes to capturing live state rather than stored facts.
But even Windsurf's dynamic context lives in the current session's window. It doesn't persist across restarts.
Line them up. Labs, startups, community tools, coding assistants. Different teams, different funding, different architectures, different markets. Apply a simple four-layer framework to each one and the same picture appears.
- Facts — what the project is, naming conventions, architectural decisions. Static, low-churn.
- Reasoning — why decisions were made, trade-offs, constraints. Medium-churn, hard to capture automatically.
- Intent — what the current task is trying to accomplish, the goal behind the goal. Session-scoped, highly perishable.
- Interpretive state — what was tried, what failed, where the agent's thinking currently is. Highest churn. Disappears on compaction.
Every tool covers Layer 1. A few cover Layer 2 — only if the developer manually writes reasoning into instruction files. Almost none touch Layer 3. None solve Layer 4.
The distribution isn't gradual. It's a cliff.
And the benchmarks reinforce the gap. LongMemEval tests factual recall. The long-context vs memory paper (March 2026) measures accuracy — can you retrieve the right fact? Long-context gets 92.85% on LoCoMo, memory systems get 57.68%. Both test the same thing: given a fact that was stored, can you get it back?
Neither asks: once you get it back, does the agent do anything different?
The industry measures what it builds. It builds what it measures. And both are Layer 1.
Here's what this costs. Mem0's $24 million. Cognee's 7.5 million euros. Google's engineering investment in the Memory Agent. Anthropic's context engineering team. AWS's managed memory service. Hundreds of person-years going into vector databases, knowledge graphs, temporal tracking, retrieval benchmarks, consolidation pipelines.
All of it makes Layer 1 better. And Layer 1 is the 16% that already works.
I know it's 16% because I measured it. A model-assisted memory extractor reads my session transcripts and captures everything it can identify as important: facts, decisions, technical findings, project state. The overlap between what it captures and what I was actually carrying: 16%.
The other 84% — which mental models were active, what I'd ruled out and why, where my reasoning was heading, what mattered right now versus what was merely present — isn't a storage problem. You can't store a schema activation. You can't put forward projection in a vector database. You can't index a goal hierarchy in a knowledge graph and have it come back alive.
Everyone builds storage because storage is the only category that exists. If you're raising $24 million, you need a benchmark you can improve on. Layer 1 has those benchmarks. Layers 2 through 4 don't. The tools that would address the other 84% can't compete because they don't have a LongMemEval score to put in a pitch deck.
This isn't a complaint about the tools. Mem0 does personalization well. Zep handles temporal facts elegantly. Aider's repo map is the most technically rigorous approach to structural facts I've found anywhere. Copilot's citation-validated memory with auto-expiry is genuinely innovative. These are good tools solving real problems.
The point is the pattern. Tens of millions of dollars. A dozen architectural approaches. Five major labs. The entire open-source ecosystem. And every single one solves the same layer of the problem.
Not because the builders are uncreative. Not because the engineering is bad. Because the category that would address Layers 2-4 doesn't have a name yet. You can't build what you can't name. You can't fund what you can't benchmark. You can't pitch what doesn't have a category in the investor's mental model.
Everyone builds storage because storage is the only thing you can point at and say: look, it works. The retrieval accuracy went up. The benchmark score improved. The facts come back.
The facts always came back. That was never the problem.