Most teams treat agent knowledge as a prompting problem. They write a longer system prompt, bolt on a vector store, and call it context engineering. But once you run agents in anger — across long-horizon tasks, multiple repositories and a fleet of contributors — a quieter truth surfaces: deciding what an agent knows, where that knowledge lives, and when it expires is not prompt-craft at all. It is systems design. And like every other systems-design problem, it has a hierarchy.
Anthropic now frames this explicitly. In its September 2025 guidance, it positions context engineering as the successor to prompt engineering, defining it as the discipline of curating and maintaining the optimal set of tokens during inference. The guiding principle — find the smallest set of high-signal tokens that maximise the likelihood of your desired outcome — reframes context as a finite resource with diminishing marginal returns, drawn against the model's attention budget. That single move, from 'write a good prompt' to 'manage a scarce budget', is what turns knowledge into an engineering trade-off.
Why the budget is real, not rhetorical
The constraint is not hypothetical. Chroma's 2025 'Context Rot' study (Hong, Troynikov and Huber) evaluated eighteen models — including GPT-4.1, Claude Opus 4, Claude Sonnet 4, Gemini 2.5 Pro and Flash, and Qwen3 — and found that every one degrades non-uniformly as input length grows, even on trivial tasks like replicating a word. A single distractor measurably reduces performance against baseline. This echoes the earlier 'Lost in the Middle' result (Liu et al., TACL 2024), which showed a U-shaped accuracy curve: models retrieve well from the start and end of context and poorly from the middle. The implication is uncomfortable for the 'just stuff it all in the window' camp. More tokens is not more knowledge; past a point it is more noise. If attention is a budget, then memory is the storage hierarchy that decides what gets spent.
Four substrates, four cost profiles
The agentic-team knowledge stack has, in practice, converged on four competing substrates. Each sits at a different point on the speed/freshness/cost curve, and most teams adopt one by default rather than placing each deliberately.
The first is the committed convention file — the registers of the hierarchy. AGENTS.md, a repo-root markdown file read natively by Codex, Cursor, Copilot, Gemini CLI, Aider, Windsurf and Zed, is now used by more than 60,000 open-source projects and, as of December 2025, is stewarded by the Agentic AI Foundation under the Linux Foundation alongside MCP and goose. It is always in context, versioned with the code, and reviewed like code. It is also small and slow to change by design — which is exactly what you want for stable conventions and exactly wrong for volatile facts.
The second is vector retrieval — the disk. Cheap, vast, and refreshable without retraining, but only as good as its chunking and ranking, and subject to the rot and middle-loss effects above. The third is the structured knowledge graph. Microsoft's GraphRAG extracts a graph from source text, builds a hierarchy of entity communities, pre-generates community summaries, and combines vector similarity with graph traversal at query time. Microsoft Research reported substantial gains in answer comprehensiveness over baseline vector RAG on the 'connect-the-dots' global questions that flat retrieval handles badly. Graphs cost more to build and maintain; they earn their keep when relationships, not just passages, are the answer.
The fourth is weight-level fine-tuning — and here the literature is bracingly contrarian about the default instinct to 'just train it on our docs'. The EMNLP 2024 study by Gekhman, Yona, Aharoni and colleagues found that fine-tuning examples carrying genuinely new knowledge are learned far more slowly than examples consistent with what the model already knows, and that once learned they linearly increase the model's tendency to hallucinate. Their conclusion: models mostly acquire facts in pre-training, and fine-tuning teaches them to use existing knowledge more efficiently. Ovadia et al. (arXiv:2312.05934) reach a complementary verdict — fine-tuning is not competitive with RAG for knowledge injection, though combining the two is cumulative, each adding accuracy that stacks. Fine-tuning is for skills, formats and behaviours, not facts.
Treat agent knowledge like a storage hierarchy — registers, cache, disk — rather than a prompt, and the question of what belongs where stops being a vibe and becomes an engineering trade-off with eviction, freshness and provenance rules.
The hierarchy is not a metaphor
The storage-hierarchy framing has a real lineage. MemGPT (arXiv:2310.08560) proposed an explicit OS-style memory hierarchy for agents: core memory always in context as a fixed-size scratchpad, recall memory as searchable history paged in on demand, and archival memory as a long-term store paged in when needed — with the model itself managing eviction and paging. The economics are not metaphorical either: KV-cache memory grows linearly with sequence length and batch size and can rival or exceed model-weight memory for long-context workloads, while self-attention scales quadratically. The cost of keeping something in the fastest tier is concrete, measurable and quadratic. That is precisely why you need eviction policy, not just storage.
The rules most teams skip: eviction, freshness, provenance
Picking substrates is the easy half. The half teams skip is lifecycle. Anthropic's own long-horizon recommendations are essentially tiering policy: compaction (summarise and reinitialise), structured note-taking held outside the context window, and sub-agent architectures that return condensed summaries rather than full outputs. Production memory practice goes further, treating freshness and eviction as first-class — TTL and tier-based lifetimes, supersession-on-contradiction, scope-bound eviction where a memory declares where it is valid and expires when that scope ends, and decay-based re-ranking that boosts recently-used memories and dampens stale ones (as articulated by mem0 and others). Without these, retrieval stores silently accumulate contradictions and convention files ossify.
Then there is provenance — the dimension that turns memory hygiene into delivery assurance. When an agent acts on a remembered fact, you increasingly need to prove where that fact came from and who stood behind the change it produced. SLSA's Source Track requires verifiable proof of who authored a change, built on the in-toto attestation framework (ITE-6), the same signed-envelope model used by Sigstore. That chain can record which human authored an intent, which agent or model generated the implementation, and which human approved it — positioned as the missing control for AI-generated commits. A memory hierarchy without provenance answers 'what does the agent believe'; it cannot answer 'and can we trust why'.
Designing the hierarchy deliberately
The practical discipline is to map each class of knowledge to a tier on purpose. Stable conventions and house style belong in committed files, reviewed as code. Volatile, high-volume reference belongs in retrieval, with TTLs and supersession. Relationship-heavy, cross-document reasoning justifies a graph. Durable behaviours and output formats — not facts — are candidates for fine-tuning, and even then chiefly to sharpen existing capability. Each placement carries an eviction rule, a freshness rule and a provenance trail. Get this right and you stop paying quadratic attention costs to re-derive what a markdown file could have asserted in twenty tokens — and you stop trusting a vector hit nobody can trace.
This is, ultimately, the same constraint that governs everything downstream of cheap generation. Abundant context does not make agents reliable; deliberately governed context does. If you want to understand why this matters more as generation gets cheaper — why the binding constraint moves from producing knowledge to accepting and trusting it — read our companion piece on the acceptance gap.