Most retrieval systems fail long before they break. The demos work, the launch goes well, and the team moves on — and then, quietly, the system stops being the one that was shipped. New documents arrive and are not indexed. Old ones are deleted from the source but keep being retrieved. Permissions change in the system of record but not in the index. Someone upgrades the embedding model and half the corpus is now speaking a different language to the other half. Nothing throws an error. The dashboards stay green. Relevance just drifts, one reasonable-looking change at a time, until a user asks why the assistant keeps citing a policy that was withdrawn months ago. Retrieval doesn’t fail loudly. It rots quietly.
Retrieval doesn’t fail loudly. It rots quietly.
Retrieval is a living system, not a launch
The mistake is in the mental model. Retrieval-augmented generation exists, at its origin, precisely because the index is meant to change: the whole point of bolting a non-parametric memory onto a model was that "the non-parametric memory can be replaced to update the model’s knowledge as the world changes." The original RAG work even demonstrated it — hot-swapping a 2016 Wikipedia index for a 2018 one and watching the answers update without retraining. Freshness is not a maintenance chore you do to a retrieval system. It is the reason the retrieval system exists. Treat it as a launch project and you have misunderstood what you built. Retrieval is a living system, not a launch — and a living system that nobody tends is one that ages.
The vector store is a commodity; the lifecycle is the moat
It is worth saying plainly, because the industry spends most of its energy in the wrong place: the vector store and the embedding model are the most commoditised and most swappable parts of a retrieval system — and, not coincidentally, the least durable. The information-retrieval benchmarks are humbling here. Across BEIR’s heterogeneous tasks, a decades-old lexical baseline (BM25) remains stubbornly robust while dense neural retrievers, strong in the domain they were tuned on, degrade out of distribution. MTEB’s verdict on embeddings is blunter still: no single model dominates across tasks, and the field "has yet to converge on a universal text embedding method." The practical consequence is that an embedding-model upgrade is not a config change; it is a corpus-wide re-embed event, because an index with half its vectors from the old model and half from the new is silently incoherent. So do not build your moat in the part you will replace twice a year. Build it in the lifecycle.
The Living Index
That lifecycle has a shape. We call it the Living Index: the disciplines that keep a retrieval system tracking the reality it is supposed to represent, rather than drifting away from it. There are three, and together they produce the property everyone actually wants — durable retrieval:
- Freshness — the index reflects the world now: replace-on-change indexing (re-embed and re-upsert when content or chunk structure changes, not patch in place), explicit deletion so a removed document stops being retrieved, change feeds from the source rather than a calendar re-index, and recency as a ranking signal where it matters.
- Permissions — retrieval respects who is asking: resolve the authenticated identity to the set of documents it is allowed to see and apply that as a query-time filter, and re-sync those permissions when they change in the system of record. Crucially, this filter is correctness, not security — security trimming is not authorization — so it must be backed by least privilege, not trusted as a wall.
- Evaluation — the system is measured continuously, not at launch: a standing regression gate of retrieval metrics (recall@k, nDCG) and RAG metrics (context precision, context recall, faithfulness) over a golden set, run on every data and model change, so decay shows up in a graph instead of a complaint.
Freshness — keep the index tracking reality
Freshness is an engineering problem with known moves. When a document’s content changes but its chunking is stable, you can update vectors in place; the moment the number or order of chunks changes, the safe pattern is replace-on-change — delete every chunk for that document, then upsert the new ones — because patching a shifted chunk layout leaves orphans that keep surfacing. Deletion is not optional housekeeping: a document removed at the source must be explicitly removed from the index, or it will be retrieved long after it should have vanished. And the trigger should be a change feed from the source — an event when a record is written, updated or deleted — not a nightly batch that re-indexes everything and still misses the thing that changed five minutes after it ran. (The change-capture and recency-ranking layer is where most teams have the least built; it is also where staleness is won or lost.)
Permissions — retrieval respects who is asking
This is the failure mode with the worst blast radius, because when it fails the system does not get worse — it leaks. The standard mechanism is sound: resolve the authenticated user to the set of documents (or groups, or attributes) they are allowed to see, and apply that as a filter at query time so the retriever only ever returns permitted material. The danger is in mistaking the mechanism for a guarantee. As one major vendor’s own documentation admits, the security principal in that filter is "just a string, used in a filter expression" — there is no authentication or authorization in it. Security trimming is not authorization. A wrong filter, a missing one, or one that has fallen out of date silently returns another user’s data — exactly the cross-user disclosure OWASP names as a top risk. And permissions rot like everything else: revoke someone’s access in the source system and they keep retrieving until that change is synced into the index. Permission drift is stale-index decay with a compliance incident attached — so changes to who-can-see-what must propagate to the index with the same urgency as changes to content. This is the retrieval-time face of the access question: the Agent Gateway governs what an agent may do; permissioned retrieval governs what it may see.
Evaluation — make the rot visible
The reason retrieval rots unnoticed is that almost nobody measures it after launch. Evaluation is the discipline that turns silent decay into a visible signal: a standing regression gate, not a one-time acceptance test. Borrow the vocabulary information retrieval settled long ago — recall@k, nDCG, precision@k, MRR — and add the RAG-specific layer that pure ranking metrics miss: context precision (did the retriever rank the relevant chunks above the noise?), context recall (did it find everything it needed?), and faithfulness (did the answer stay grounded in what was retrieved?). Run them against a golden set on every data refresh and every model change, across more than one slice of your corpus, because a retriever that looks excellent on the domain it was tuned on can quietly fall apart elsewhere. The point is not the dashboard. The point is that the day the embedding upgrade halves your recall, a graph turns red before a customer does.
Six months on it was technically the same system and operationally a different one — new documents unindexed, deleted ones still surfacing, permissions drifted. Nothing failed. No alarm fired. Quality declined one small decision at a time, and the first indication was not a monitor — it was a user, asking why the system kept returning information they knew was out of date.
| Discipline | The rot it prevents | What it does |
|---|---|---|
| Freshness | Stale index — changed or deleted sources still surface | Re-index on change; recency-aware ranking |
| Permissions | Permission drift — users see what they no longer should | Enforce authorization at query time, not as a filter string |
| Evaluation | Silent quality regression with no error in the logs | Continuous evaluation against a known-good set |
Patterns that buy durability — on top, not instead
None of this means the retrieval techniques do not matter; it means they sit on top of the lifecycle, not in place of it. The patterns that earn their keep are well established: hybrid search that runs lexical BM25 alongside dense vectors, because embeddings capture meaning but miss the exact string a user typed; a reranking pass over the shortlist; and contextual retrieval, which prepends a short, document-aware description to each chunk before embedding. Anthropic reports these compounding — contextual embeddings cutting top-20 retrieval failures by around a third, adding contextual BM25 taking it to roughly half, and adding a reranker to about two-thirds — though those are its own measurements, and reranking buys accuracy at a real cost in compute and latency. Use them. But notice that all of them improve quality at a point in time, and none of them keeps the index fresh, permissioned or evaluated tomorrow. The techniques are how retrieval gets good. The lifecycle is how it stays good.
The index is a mirror
There is a deeper way to read all of this. The index is a mirror: when retrieval degrades, it is usually reflecting a problem upstream — sources that changed without anyone telling the index, permissions that moved, content that was restructured, knowledge that was never very well governed to begin with. Which is why this article has a sibling one level up. Retrieval is one link in the context supply chain, and freshness, permissions and evaluation are how that link is kept trustworthy over time. Seen together, three questions form a single engineering doctrine for agents that act in the real world: what can the agent do (the Agent Gateway), what can it trust (the Context Supply Chain), and how does that trust stay true over time (the Living Index). Build retrieval as a living system, and you are really building the part of that doctrine that does not hold still.
Tend it, or it rots
Your retrieval system is probably rotting right now, and the reason you have not noticed is the reason it is dangerous: nothing has failed. The fix is not a better vector store or a newer embedding model — those are the parts you will swap anyway. The fix is to treat retrieval as the living system it always was: a freshness pipeline that tracks change, permissions that respect who is asking and stay in sync, and an evaluation gate that makes decay visible before a customer does. Freshness, permissions and evals are not operational chores bolted onto a retrieval system. They are the architecture.
Frequently asked
- Why does a retrieval / RAG system get worse over time?
- It rots: the index goes stale as sources change, embeddings drift when a model is upgraded, permissions fall out of sync with the system of record, and quality regresses — none of which throws an error. Retrieval doesn’t fail loudly; it rots quietly, until a user notices.
- What is the least durable part of a retrieval system?
- The embedding model and the vector store — the most commoditised and swappable parts. BM25 remains a robust baseline and no embedding model wins across all tasks, so a model upgrade is a corpus-wide re-embed event. Build durability into the lifecycle (freshness, permissions, evaluation), not the store.
- How do you keep a retrieval index fresh?
- Replace-on-change indexing (re-embed and re-upsert when content or chunk structure changes, not patch in place), explicit deletion so removed documents stop being retrieved, change feeds from the source rather than a calendar re-index, and recency as a ranking signal where it matters.
- How do you stop an agent retrieving documents a user isn’t allowed to see?
- Resolve the authenticated identity to permitted documents/attributes and apply that as a query-time filter, and re-sync permissions when they change in the source. But security trimming is not authorization — the filter is correctness, not a wall — so back it with least privilege and real access control. Revoked access keeps leaking until the index is re-synced (permission drift).
- How do you evaluate retrieval as a regression gate?
- Run IR metrics (recall@k, nDCG, precision@k, MRR) plus RAG metrics (context precision, context recall, faithfulness) against a golden set on every data and model change, across more than one slice of the corpus — so silent decay shows up as a red graph instead of a customer complaint.