Product Labs · Experiment

Context & Knowledge Systems

An open experiment in whether curated, versioned and owned context beats simply giving an agent more documents.

The context window keeps getting bigger, and the instinct is to fill it. If a model can read a million tokens, why not hand it the whole codebase, every ticket, the full wiki and last quarter's meeting notes, and let it sort things out? We have spent enough time building real systems on top of agents to be suspicious of that instinct. This theme is where we test it. The question we keep returning to is simple to state and hard to answer well: does an agent produce better work when you give it more, or when you give it the right thing, in the right shape, at the right moment? We are not AI researchers, and we make no claim to have settled this. We are delivery, architecture and product engineers applying emerging tools, and what we are learning so far points one way: bigger context windows do not create better results. Curated, versioned and owned context does.

What we explore and build

Most of our work in this theme is about treating context as something you design, not something you dump. A few patterns recur often enough that we now reach for them by default. The first is the structured, versioned context pack: a small, deliberate set of conventions, interfaces and constraints written in a form an agent can actually use, kept under version control so that what the agent was told is auditable and revertible rather than lost in a chat history. The second is project memory split into knowledge shards, so that a long-running task that spans many sessions begins each session with explicit, persisted state instead of starting cold. The third is structured retrieval over an indexed corpus, where the agent surfaces the specific shard a task needs rather than reading everything adjacent to it. The fourth, and the one we think matters most over time, is tiering knowledge by purpose, with explicit rules for freshness, eviction and provenance, so that hot, frequently-needed facts stay close, cold reference material is reachable but out of the way, and stale material is retired before it can mislead.

These are convictions tested against building, not finished methodology. Each one began as a workaround for something that went wrong, and each is still being revised as we hit its limits. The connective tissue between them is the discipline we describe in Context Engineering: Context Is the New Architecture: deciding what enters the window each turn is itself the work.

What the field already knows

We are not building these patterns in the dark; there is a growing body of evidence that the more-is-better assumption is wrong, and it converges from several directions. The foundational result is positional: models use information best when it sits at the start or end of the input and measurably worse when the relevant fact is buried in the middle of a long context, a pattern that holds even for models explicitly marketed as long-context. More recent work generalises this into what its authors call context rot. Tested across eighteen frontier models on tasks as trivial as retrieval and replication, performance does not hold steady as input grows; it degrades non-uniformly, and topically-related distractors make it worse the longer the context gets. Counter-intuitively, models sometimes scored higher on a shuffled haystack than on a logically-ordered one. The lesson is not that long context is useless, but that filling the window is an active liability you are choosing to take on.

On the retrieval side, structured approaches keep beating naive ones. Graph-structured retrieval that extracts an entity-relationship graph from a corpus, builds a hierarchy of communities and pre-summarises them outperformed conventional vector search on broad sensemaking questions over million-token corpora, while answering some queries with over ninety per cent fewer context tokens. That is the same lesson again from the opposite end: less, but better-organised, context wins. And on the question of what context should contain, the evidence is that freshness and provenance are first-class concerns rather than polish. Models struggle systematically with fast-changing facts and with false premises they are meant to debunk; injecting up-to-date, source-attributed evidence raises accuracy materially, and both the number and the ordering of retrieved evidence affect whether the answer is correct. Newer evaluation work goes further, separating whether an answer is correct from whether it is faithfully grounded in the source it cites, which is why we treat provenance as linking a specific claim to the specific shard that supports it, not as a tidy source list at the bottom.

What we are learning

Holding our own building up against that evidence, a thesis has formed. The constraint that matters is not the size of the window but the quality of what occupies it, and quality is a function of curation, versioning and ownership. Curation, because context is a finite resource with diminishing returns and the practitioner's real job is choosing what enters it each turn rather than pre-loading a corpus; the same source-attributed evidence work prescribes just-in-time retrieval, where an agent holds lightweight identifiers and loads data at runtime, plus note-taking persisted outside the window. Versioning, because an agent that spans sessions needs explicit persisted state, and structured artefacts committed with descriptive history give you something auditable and revertible instead of prose that the next run quietly overwrites. Ownership, because context you control is context you can keep fresh, evict when stale, and trace back to a source you trust.

More context is not more intelligence. An agent does not need everything we know; it needs the right shard, in the right shape, at the right moment, and a record of where it came from.

Said plainly, we have come to see this less as prompt-stuffing and more as memory architecture. The strongest precedent we lean on borrows from operating systems: a tiered hierarchy of an active window, a searchable warm store and a cold archive, with material moved between tiers on demand to give effectively unbounded reach within a fixed window. That framing matches what we keep rediscovering by hand, and it is the spine of the work we describe in The Memory Hierarchy of an Agentic Team: Choosing Between Convention Files, Retrieval, Graphs and Fine-Tuning.

Why this sits at the Experiment stage

We are deliberately honest that this is an experiment, not a product and not a proven method. The field is young, our track record in it is short, and several questions we cannot yet answer well would have to be answered before we would call any of this mature. We do not have a clean way to measure the cost of a curation regime against the messy alternative of dumping everything, and we suspect the answer differs by task. Eviction policy is mostly judgement rather than rule: knowing when a shard has gone stale is easier to assert than to automate. Provenance that traces each claim to its evidence is something we can build but not yet verify at scale, and faithfulness can fail even when correctness holds. We are encouraged that the wider community is formalising exactly these axes; recent retrieval evaluation now treats relevance, completeness and attribution as standard measures rather than afterthoughts, which gives us a yardstick we did not have a year ago. But formalising the question is not the same as having answered it, and we would rather say so than overclaim.

Where this goes next

The direction of travel is from intuition toward measurement. We want to put numbers on the trade we currently make on instinct: how much curated context, tiered how, beats how much raw context, and where the curve turns. We want eviction and freshness to become testable policies rather than habits, and provenance to be something we can audit claim by claim. None of that will make us experts overnight, and we are comfortable building in the open while the answers are still forming. If you are wrestling with the same problem, the adjacent work on The Acceptance Gap and the The AI Engineering Maturity Model is where we connect context discipline to whether teams actually trust and adopt what these systems produce. For now, the working belief stands, and we keep testing it against real building: own your context, shape it, version it, and let the window stay mostly empty.