Why does a retrieval / RAG system get worse over time?

It rots: the index goes stale as sources change, embeddings drift when a model is upgraded, permissions fall out of sync with the system of record, and quality regresses — none of which throws an error. Retrieval doesn’t fail loudly; it rots quietly, until a user notices.

What is the least durable part of a retrieval system?

The embedding model and the vector store — the most commoditised and swappable parts. BM25 remains a robust baseline and no embedding model wins across all tasks, so a model upgrade is a corpus-wide re-embed event. Build durability into the lifecycle (freshness, permissions, evaluation), not the store.

How do you keep a retrieval index fresh?

Replace-on-change indexing (re-embed and re-upsert when content or chunk structure changes, not patch in place), explicit deletion so removed documents stop being retrieved, change feeds from the source rather than a calendar re-index, and recency as a ranking signal where it matters.

How do you stop an agent retrieving documents a user isn’t allowed to see?

Resolve the authenticated identity to permitted documents/attributes and apply that as a query-time filter, and re-sync permissions when they change in the source. But security trimming is not authorization — the filter is correctness, not a wall — so back it with least privilege and real access control. Revoked access keeps leaking until the index is re-synced (permission drift).

How do you evaluate retrieval as a regression gate?

Run IR metrics (recall@k, nDCG, precision@k, MRR) plus RAG metrics (context precision, context recall, faithfulness) against a golden set on every data and model change, across more than one slice of the corpus — so silent decay shows up as a red graph instead of a customer complaint.

Context Architecture · 13 min read · Updated 2026-06-21

Retrieval Architecture That Doesn’t Rot

Most retrieval systems fail long before they break. Retrieval doesn’t fail loudly — it rots quietly: the index goes stale, permissions drift, quality regresses with no error in the logs. The durable architecture is not the vector store; it is the lifecycle. Retrieval is a living system, not a launch.

By Priyanka Pandey · Founder & Editorial Lead

Reviewed and challenged by Sanjeev Purohit · Principal, Decision Architecture

Built from

Field experience
Independent research
Data-backed
Original framework
Reviewed with field experience

Last substantively reviewed · 2026-06-21

In brief

A retrieval/RAG system is a living system that rots silently — stale index, embedding drift, permission drift, silent eval regression — so durability comes from the lifecycle, not the store: the Living Index (Freshness, Permissions, Evaluation) producing Durable Retrieval; retrieval is a living system, not a launch.

Most retrieval systems fail long before they break. Retrieval doesn’t fail loudly — it rots quietly, with no error in the logs, until a user notices.
Retrieval is a living system, not a launch: RAG exists because the index is mutable, so freshness is the point, not a maintenance chore.
The vector store and embedding model are the most commoditised, swappable and least durable parts; a model upgrade is a corpus-wide re-embed event. Build the moat in the lifecycle, not the store.
The Living Index = Freshness + Permissions + Evaluation → the outcome Durable Retrieval.
Permissioned retrieval is the worst-blast-radius failure: security trimming is not authorization (the principal is just a string in a filter); permission drift leaks revoked access until the index is re-synced.
Evaluation makes rot visible: a standing IR + RAG regression gate (recall@k, nDCG, context precision/recall, faithfulness) on every data/model change, not a one-time launch check.
The index is a mirror — when retrieval degrades it usually reflects a problem upstream. X11 is the engineering of the retrieval link in the X10 Context Supply Chain, and the see-side of X09 access.

Most retrieval systems fail long before they break. The demos work, the launch goes well, and the team moves on — and then, quietly, the system stops being the one that was shipped. New documents arrive and are not indexed. Old ones are deleted from the source but keep being retrieved. Permissions change in the system of record but not in the index. Someone upgrades the embedding model and half the corpus is now speaking a different language to the other half. Nothing throws an error. The dashboards stay green. Relevance just drifts, one reasonable-looking change at a time, until a user asks why the assistant keeps citing a policy that was withdrawn months ago. Retrieval doesn’t fail loudly. It rots quietly.

Retrieval doesn’t fail loudly. It rots quietly.

Retrieval is a living system, not a launch

The mistake is in the mental model. Retrieval-augmented generation exists, at its origin, precisely because the index is meant to change: the whole point of bolting a non-parametric memory onto a model was that "the non-parametric memory can be replaced to update the model’s knowledge as the world changes." The original RAG work even demonstrated it — hot-swapping a 2016 Wikipedia index for a 2018 one and watching the answers update without retraining. Freshness is not a maintenance chore you do to a retrieval system. It is the reason the retrieval system exists. Treat it as a launch project and you have misunderstood what you built. Retrieval is a living system, not a launch — and a living system that nobody tends is one that ages.

The four ways retrieval rots — stale index, embedding drift, permission drift, silent eval regression — none of which throws an error.

The vector store is a commodity; the lifecycle is the moat

It is worth saying plainly, because the industry spends most of its energy in the wrong place: the vector store and the embedding model are the most commoditised and most swappable parts of a retrieval system — and, not coincidentally, the least durable. The information-retrieval benchmarks are humbling here. Across BEIR’s heterogeneous tasks, a decades-old lexical baseline (BM25) remains stubbornly robust while dense neural retrievers, strong in the domain they were tuned on, degrade out of distribution. MTEB’s verdict on embeddings is blunter still: no single model dominates across tasks, and the field "has yet to converge on a universal text embedding method." The practical consequence is that an embedding-model upgrade is not a config change; it is a corpus-wide re-embed event, because an index with half its vectors from the old model and half from the new is silently incoherent. So do not build your moat in the part you will replace twice a year. Build it in the lifecycle.

The Living Index

That lifecycle has a shape. We call it the Living Index: the disciplines that keep a retrieval system tracking the reality it is supposed to represent, rather than drifting away from it. There are three, and together they produce the property everyone actually wants — durable retrieval:

Freshness — the index reflects the world now: replace-on-change indexing (re-embed and re-upsert when content or chunk structure changes, not patch in place), explicit deletion so a removed document stops being retrieved, change feeds from the source rather than a calendar re-index, and recency as a ranking signal where it matters.
Permissions — retrieval respects who is asking: resolve the authenticated identity to the set of documents it is allowed to see and apply that as a query-time filter, and re-sync those permissions when they change in the system of record. Crucially, this filter is correctness, not security — security trimming is not authorization — so it must be backed by least privilege, not trusted as a wall.
Evaluation — the system is measured continuously, not at launch: a standing regression gate of retrieval metrics (recall@k, nDCG) and RAG metrics (context precision, context recall, faithfulness) over a golden set, run on every data and model change, so decay shows up in a graph instead of a complaint.

The Living Index: Freshness · Permissions · Evaluation → the outcome everyone wants, Durable Retrieval.

Freshness — keep the index tracking reality

Freshness is an engineering problem with known moves. When a document’s content changes but its chunking is stable, you can update vectors in place; the moment the number or order of chunks changes, the safe pattern is replace-on-change — delete every chunk for that document, then upsert the new ones — because patching a shifted chunk layout leaves orphans that keep surfacing. Deletion is not optional housekeeping: a document removed at the source must be explicitly removed from the index, or it will be retrieved long after it should have vanished. And the trigger should be a change feed from the source — an event when a record is written, updated or deleted — not a nightly batch that re-indexes everything and still misses the thing that changed five minutes after it ran. (The change-capture and recency-ranking layer is where most teams have the least built; it is also where staleness is won or lost.)

Permissions — retrieval respects who is asking

This is the failure mode with the worst blast radius, because when it fails the system does not get worse — it leaks. The standard mechanism is sound: resolve the authenticated user to the set of documents (or groups, or attributes) they are allowed to see, and apply that as a filter at query time so the retriever only ever returns permitted material. The danger is in mistaking the mechanism for a guarantee. As one major vendor’s own documentation admits, the security principal in that filter is "just a string, used in a filter expression" — there is no authentication or authorization in it. Security trimming is not authorization. A wrong filter, a missing one, or one that has fallen out of date silently returns another user’s data — exactly the cross-user disclosure OWASP names as a top risk. And permissions rot like everything else: revoke someone’s access in the source system and they keep retrieving until that change is synced into the index. Permission drift is stale-index decay with a compliance incident attached — so changes to who-can-see-what must propagate to the index with the same urgency as changes to content. This is the retrieval-time face of the access question: the Agent Gateway governs what an agent may do; permissioned retrieval governs what it may see.

Evaluation — make the rot visible

The reason retrieval rots unnoticed is that almost nobody measures it after launch. Evaluation is the discipline that turns silent decay into a visible signal: a standing regression gate, not a one-time acceptance test. Borrow the vocabulary information retrieval settled long ago — recall@k, nDCG, precision@k, MRR — and add the RAG-specific layer that pure ranking metrics miss: context precision (did the retriever rank the relevant chunks above the noise?), context recall (did it find everything it needed?), and faithfulness (did the answer stay grounded in what was retrieved?). Run them against a golden set on every data refresh and every model change, across more than one slice of your corpus, because a retriever that looks excellent on the domain it was tuned on can quietly fall apart elsewhere. The point is not the dashboard. The point is that the day the embedding upgrade halves your recall, a graph turns red before a customer does.

Six months on it was technically the same system and operationally a different one — new documents unindexed, deleted ones still surfacing, permissions drifted. Nothing failed. No alarm fired. Quality declined one small decision at a time, and the first indication was not a monitor — it was a user, asking why the system kept returning information they knew was out of date.
— Sanjeev Purohit, from our delivery work

Discipline	The rot it prevents	What it does
Freshness	Stale index — changed or deleted sources still surface	Re-index on change; recency-aware ranking
Permissions	Permission drift — users see what they no longer should	Enforce authorization at query time, not as a filter string
Evaluation	Silent quality regression with no error in the logs	Continuous evaluation against a known-good set

The Living Index — retrieval is a living system, not a launch. Durability lives in the lifecycle, not the vector store.

Patterns that buy durability — on top, not instead

None of this means the retrieval techniques do not matter; it means they sit on top of the lifecycle, not in place of it. The patterns that earn their keep are well established: hybrid search that runs lexical BM25 alongside dense vectors, because embeddings capture meaning but miss the exact string a user typed; a reranking pass over the shortlist; and contextual retrieval, which prepends a short, document-aware description to each chunk before embedding. Anthropic reports these compounding — contextual embeddings cutting top-20 retrieval failures by around a third, adding contextual BM25 taking it to roughly half, and adding a reranker to about two-thirds — though those are its own measurements, and reranking buys accuracy at a real cost in compute and latency. Use them. But notice that all of them improve quality at a point in time, and none of them keeps the index fresh, permissioned or evaluated tomorrow. The techniques are how retrieval gets good. The lifecycle is how it stays good.

The index is a mirror

There is a deeper way to read all of this. The index is a mirror: when retrieval degrades, it is usually reflecting a problem upstream — sources that changed without anyone telling the index, permissions that moved, content that was restructured, knowledge that was never very well governed to begin with. Which is why this article has a sibling one level up. Retrieval is one link in the context supply chain, and freshness, permissions and evaluation are how that link is kept trustworthy over time. Seen together, three questions form a single engineering doctrine for agents that act in the real world: what can the agent do (the Agent Gateway), what can it trust (the Context Supply Chain), and how does that trust stay true over time (the Living Index). Build retrieval as a living system, and you are really building the part of that doctrine that does not hold still.

One engineering doctrine, three questions: what can the agent DO (Agent Gateway), TRUST (Context Supply Chain), and how does that trust stay TRUE over time (the Living Index).

Tend it, or it rots

Your retrieval system is probably rotting right now, and the reason you have not noticed is the reason it is dangerous: nothing has failed. The fix is not a better vector store or a newer embedding model — those are the parts you will swap anyway. The fix is to treat retrieval as the living system it always was: a freshness pipeline that tracks change, permissions that respect who is asking and stay in sync, and an evaluation gate that makes decay visible before a customer does. Freshness, permissions and evals are not operational chores bolted onto a retrieval system. They are the architecture.

Frequently asked

Why does a retrieval / RAG system get worse over time?: It rots: the index goes stale as sources change, embeddings drift when a model is upgraded, permissions fall out of sync with the system of record, and quality regresses — none of which throws an error. Retrieval doesn’t fail loudly; it rots quietly, until a user notices.
What is the least durable part of a retrieval system?: The embedding model and the vector store — the most commoditised and swappable parts. BM25 remains a robust baseline and no embedding model wins across all tasks, so a model upgrade is a corpus-wide re-embed event. Build durability into the lifecycle (freshness, permissions, evaluation), not the store.
How do you keep a retrieval index fresh?: Replace-on-change indexing (re-embed and re-upsert when content or chunk structure changes, not patch in place), explicit deletion so removed documents stop being retrieved, change feeds from the source rather than a calendar re-index, and recency as a ranking signal where it matters.
How do you stop an agent retrieving documents a user isn’t allowed to see?: Resolve the authenticated identity to permitted documents/attributes and apply that as a query-time filter, and re-sync permissions when they change in the source. But security trimming is not authorization — the filter is correctness, not a wall — so back it with least privilege and real access control. Revoked access keeps leaking until the index is re-synced (permission drift).
How do you evaluate retrieval as a regression gate?: Run IR metrics (recall@k, nDCG, precision@k, MRR) plus RAG metrics (context precision, context recall, faithfulness) against a golden set on every data and model change, across more than one slice of the corpus — so silent decay shows up as a red graph instead of a customer complaint.

Our perspective

The common view

RAG is a build: pick the best vector store and embedding model, index your documents, ship it. Quality is a function of the model and the store; once it works at launch, it works.

The Ivaaya view

Retrieval is a living system, not a launch — it rots silently (stale index, embedding drift, permission drift, silent eval regression). The store and embedding model are the commoditised, swappable, least durable parts; the durable architecture is the lifecycle: the Living Index of Freshness, Permissions and Evaluation, producing Durable Retrieval. The index is a mirror of upstream knowledge quality, which is why this is the retrieval link of the Context Supply Chain and the see-side of the Agent Gateway.

“We picked the best embedding model and vector DB, so retrieval is solved.”: — Those are the parts you will replace twice a year — BM25 still beats many dense retrievers out of distribution, and no embedding model wins across tasks. A model upgrade is a corpus-wide re-embed event. Durability lives in the lifecycle, not the store.
“Metadata filters give us document-level access control.”: — Security trimming is not authorization — the principal is just a string in a filter, so a wrong, missing or stale filter silently leaks another user’s data (OWASP LLM02). Back it with least privilege and real access control, and re-sync permissions on change or they drift.
“It worked at launch and we haven’t had incidents, so it’s fine.”: — No incident is exactly the symptom — retrieval rots without failing. Without a standing evaluation gate you learn about decay from a customer, not a graph. Most retrieval systems fail long before they break.

Treat retrieval as a production system with a lifecycle: change feeds + replace-on-change indexing + explicit deletion, not a calendar re-index.
On embedding-model upgrade, re-embed the whole corpus (blue/green or shadow), never run a mixed-embedding index.
Resolve identity → permitted documents → query-time filter, re-sync permissions on ACL change, and back the filter with least privilege (it is correctness, not security).
Run a standing IR + RAG evaluation regression gate on every data/model change, multi-dataset, so silent decay is caught early.

The evidence & related ideas →

What we’ve observed

RAG exists because the index is mutable — non-parametric memory is replaceable to update knowledge; "index hot-swapping" demonstrated (Lewis et al. 2020, NeurIPS).
The embedding/neural layer is the least durable, least generalisable link: BM25 a robust baseline, dense retrieval underperforms out-of-distribution (BEIR 2021 — dated; gap narrowed not closed); no embedding model dominates across tasks (MTEB).
Permissioned retrieval = identity → permitted attributes → query-time filter (Azure AI Search, AWS Bedrock); but the filter principal is "just a string… no authentication or authorization" (Azure) — security trimming is not authZ; OWASP LLM02 names cross-user disclosure a top risk.
Permission drift: ACL/group changes don’t take effect until re-synced into the index — revoked access keeps leaking (Azure "timing lag").
Standing eval uses IR metrics (NDCG@k, MAP@k, Recall@k, Precision@k, MRR) across heterogeneous datasets (BEIR) + RAG metrics context precision/recall/faithfulness (RAGAS).
Hybrid (BM25+dense) + reranking + contextual retrieval compound to large retrieval-failure reductions (~35%/49%/67%, Anthropic’s own measurements; reranking trades compute/latency for accuracy).

A RAG "finished" at launch and a different system six months later — un-re-embedded index after a model bump, deleted docs still surfacing, no eval; the only signal a slow quality drift one customer finally named.
A revoked employee whose documents the agent kept surfacing because the permission change never propagated to the index.

How certain are we?

RAG is a living system; the index is mutable by design and must be kept fresh — established: Observed repeatedly across delivery programmes.
The embedding model / vector store is the least durable, swappable layer (BM25 robust, no universal embedding) — established: Observed repeatedly across delivery programmes.
Security trimming is not authorization; permission drift leaks revoked access until re-sync — established: Observed repeatedly across delivery programmes.
Durability comes from the lifecycle (Freshness/Permissions/Evaluation), not the store (our argued framework) — emerging: Still early, but increasingly visible.
Retrieval decays silently over deployment time (mechanisms established; longitudinal decay not yet measured) — emerging: Still early, but increasingly visible.

About the author

Priyanka Pandey

Founder & Editorial Lead

Priyanka Pandey founded Ivaaya and leads its editorial voice, translating real delivery experience into practical thinking on AI-native engineering, decision-making and technology leadership. Her work focuses on helping senior leaders make sense of the changes reshaping software delivery without adding to the noise.

Reviewed and challenged by

Sanjeev Purohit

Principal, Decision Architecture

Sanjeev works across enterprise architecture, product strategy and AI-native delivery. The ideas in this article have been challenged against real programmes, production systems and organisational decision-making before publication.

Part of a perspective

For the Architect — Architecture as decision-making in the agentic eraStep 7 of 8 →

Related thinking

Compare notes

If your retrieval system worked at launch and nobody is quite sure it still does, that is the failure mode — it rots quietly, without an error in the logs. Tell us where your index might be drifting from reality; we are comparing notes with teams treating retrieval as a living system — freshness, permissions and evaluation as ongoing disciplines — rather than a launch project they shipped and left.

Is your retrieval system still fresh? →

This made me think of…