Silent but Deadly — Context Rot Problems in Legal

Mar 27

We have mentioned “ceilings” in LLM performance before, and we are hearing more folks talk about these “ceilings”.

We are having conversations with firms who have adopted the major Gen AI platforms, who have rolled them out across their lawyers, and who are now beginning to explore alternatives to the major Gen AI platforms.

The Gen AI platforms work for a great number of tasks, but they are not working for everyone and not for everything. In specific, important ways, the performance is not meeting expectations.

Two kinds of lawyers facing one shared reliability ceiling

Inside law firms, the lawyers have split into roughly two camps:

The first camp are the laggards. They use the AI assistant like a search engine. They have not figured out how to use tabular review. They need training, and they know it.
The second camp are the advanced users. They know how to prompt. They know how to structure a query. They have built workflows and habits. They are getting value.

But, both groups are reporting they are hitting the same reliability ceiling.

That reliability ceiling has two dimensions:

The first is factual accuracy — did the AI actually identify and correctly extract the relevant information in the source documents? For legal work, this must be 100%. Anything built on top of incomplete or incorrect facts is worthless, regardless of how good the reasoning sounds.
The second is analytical quality — did the AI apply the right reasoning, flag the right issues, draw the right conclusions? This is where most lawyers land at something like “75%” and call it “good enough”. The AI has given them a solid starting point.

For many tasks, especially for long documents, these numbers are much lower than most folks admit.

The law firm as quality assurance

Before we dive into the technical research, it is worth examining how law firms are built to ensure the quality of legal work.

The leverage model — juniors, mid-levels, and partners — is not just an revenue structure. The leverage model is a quality assurance architecture. We, as a profession, have calibrated this QA architecture over decades and centuries of legal practice.

The junior lawyer reads every single word of the source documents. They put together a first draft — an agreement, an advice, a review memo. Their factual accuracy is, say, around 90%. Their analytical quality is, say, maybe 50% — they are inexperienced, so they do not always know what matters.
That draft goes to the mid-level lawyer. The mid-level has done this before. They know the client, they know the matter, they can distinguish what is standard from what is specific to this situation. They lift the analytical quality by 30% to, say, 80%. They also catch the factual gaps the junior missed, lifting that from 90% to 95%.
Then the work goes to the partner. The partner has seen this type of document a hundred times. They have a market view. They know what is soft, what is aggressive, what the client can push for that they have not even asked for yet. They lift the analytical quality again, maybe to 99%, and they catch the things neither the junior nor the mid-level knew to look for, lifting the factual accuracy to 99%.

Every layer catches different failures. Together they form our familiar QA system — each pass improving both the factual foundation and the analytical superstructure. This has worked extraordinarily well. It is, in our opinion, one of the most effective quality assurance systems ever built for knowledge work.

The problem is that the law firm apprentice system was built to catch human errors. More specifically, it was built on an assumption that the work product handed up the chain is a complete representation of the source material. A junior may misunderstand clause 14.3. They will not silently forget to include it.

AI doesn't fail like humans fail

When a junior lawyer makes a mistake, it looks like a junior lawyer mistake. They missed something because they did not know to look for it. They misread a provision because they lack experience. They dropped a point because they ran out of time.

The mid-level knows what juniors get wrong.

The partner knows what mid-levels miss.

The whole system is calibrated to these failure signatures.

AI fails differently.

For short documents — something under roughly 100 pages — a well-deployed language model is arguably more reliable than a junior lawyer on factual accuracy. It reads every word. It does not get tired. On a short, bounded task, AI might produce a work product that has a 95% factual accuracy where a junior gives you 90%. For longer documents, the picture changes. LLMs processing long documents suffer from context rot, and context rot is silent.

Context rot is more than just one thing. It is a family of related failures, each producing the same symptom: output that looks complete but isn't.

(Positional) The model stops paying attention in the middle. Language models read documents the way an exhausted person does — they focus hard at the start, drift through the middle, and snap back to attention at the end. Stanford researchers measured accuracy dropped from 70% to 55% when the relevant information moved from the beginning to the middle of the document (Liu et al., Lost in the Middle, TACL 2024). In a 200-page legal document, the text that matters is almost never at the start or the end.
(Volume) More words means worse thinking. A 2025 research paper (Du et al.) gave models perfect access to all the information they needed and then varied how much surrounding text was in the document. Reasoning accuracy fell by up to 85% as the document got longer — not because anything was missing, not because the model couldn't find the answer, but simply because of volume. A longer document makes the model a worse “thinker”, regardless of what is in it.
(Capacity) The effective context window is smaller than advertised. NVIDIA's RULER benchmark (2024) tested models on real reasoning tasks — tracing how provisions interact, following cross-references, aggregating across clauses — and found that despite near-perfect scores on simple retrieval, performance dropped sharply as context grew. The effective usable context length for most models sits at 50 to 65% of the token limit. The number on the box is not the number that matters for legal work.
(Overflow) When the context is too long, the instructions get thrown out first. Most legal AI tools operate with a set of instructions telling the model what to do — apply this jurisdiction, flag these issues, follow these rules, etc. When a document exceeds the model's context window, some systems quietly delete the oldest content to make room. That oldest content is the typically the instructions. The model keeps going, producing confident output, with no indication that it has forgotten what it was instructed to do.
(Compression) Summarising a long document destroys the parts that matter most to lawyers. One common workaround to the context window limitation is to compress or summarise a document before the model reads it. The problem is that compression works by removing language that looks repetitive or formulaic — and legal documents are full of words that look formulaic but are legally critical. "Subject to," "notwithstanding the foregoing," "except as provided in Section K" get stripped out. Research found that over 80% of compressed summaries contained factual errors, concentrated precisely in this kind of conditional language (Pagnoni et al., FRANK, NAACL 2021). The summary keeps the headline, but it may lose the carve-out that changes what the headline means.

So on a long document, instead of working at 95% factual accuracy, the AI might give you something that is 60% factually accurate. The model does not tell you it has suffered context rot. The output does not look 60% complete. It looks confident, well-formatted, and internally coherent.

Now, imagine the law firm QA machine again — the junior hands the AI work to a mid-level, and the mid-level hands the AI output to the partner. The partner applies their pattern-matching, their market knowledge, their client instinct… and they lift the analytical quality — as they always have. However, the partner is the furthest from the actual words of the documents. They are working on a piece of legal work that has been created from a 60% factually accurate base, but where the statistical predictions of the LLM makes the output feel complete and coherent, because the model fills the gaps with confident-sounding invented content drawn from its training, not from the source document.

The law firm QA machine is running, but it is working on broken inputs.

The “footnote trap”

We noticed an analogous problem back in 2023.

When Gen AI tools first came to legal, many of them inserted citations — hyperlinks, footnotes, source references — that seemed to show their reasoning. “We concluded X because we found this text in this document.”

The citations looked authoritative. They looked like rigour. So, people took these citations as truth, and they often did not click the citations to check. Lawyers saw the footnote and assumed the AI had done the work. The appearance of a citation became a substitute for verifying the citation. In practice, this led to higher error rates, because the appearance of authority suppressed the instinct to check. As of March 2026, there are over 1,200 documented cases worldwide of AI-generated hallucinated content entering court proceedings (Charlotin, AI Hallucination Cases Database, 2026).

Only through hard experience — hallucinated cases, fabricated precedents, citations that led nowhere — did the legal profession roll out training that taught lawyers to verify everything, regardless of how legitimate the footnote looked.

Context rot is the footnote trap at a deeper level.

Context rot is not a cosmetic failure — a missing citation, a wrong date. It is a structural failure in the factual foundation of the work. And it produces output that is much harder to interrogate than a suspicious footnote, because there is nothing suspicious about it. It does not look wrong. It looks finished.

Stanford's RegLab found that when asked direct, verifiable questions about federal court cases, language models hallucinate between 69% and 88% of the time (Dahl et al., Large Legal Fictions, Journal of Legal Analysis, 2024).
Commercial RAG-based legal tools — the products specifically built to address this problem — still hallucinate on 17 to 33% of queries (Magesh et al., Hallucination-Free?, Journal of Empirical Legal Studies, 2025).
For multi-clause compliance reasoning across real contracts — tracking how an obligation in clause 7 is qualified by a carve-out in Schedule 3 which incorporates a definition from Article 1 — state-of-the-art models achieve 34 to 57% accuracy (Singh et al., COMPACT/ACE benchmark, EACL 2026).

These are not fringe failures. They are the norm.

Today’s QA machine cannot fix this

The law firm's verification infrastructure is sophisticated and largely functional. It is also completely blind to context rot.

Partners know what juniors get wrong. Mid-levels know where to look when reviewing a junior's draft. These instincts are calibrated to human error patterns — omissions from inexperience, misjudgements from lack of context, gaps from time pressure.

AI error signatures are different. The failure is positional — it falls in the middle of documents, where attention is thinnest. It is structural — it drops when the context window fills and oldest tokens are silently discarded. It is architectural — reasoning degrades under token pressure independent of whether the right information was retrieved. None of these failure modes look like anything a partner has been trained to catch, because none of them existed before 2023.

There is also a bypass problem. AI output enters the review workflow at the wrong level. It looks finished — polished, confident, well-structured — so it gets treated as finished. The junior-level review that was supposed to catch line-by-line completeness issues gets skipped. The partner reviews the analysis. Nobody re-reads the source document against the AI's output the way a mid-level would have reviewed a junior's draft. The machine is running. The input bypassed the first two filters.

The alarm system was not wired for this.

The electric motor, not the steam engine

We believe there is a way through this, but it requires a different mental model for where Gen AI belongs in legal workflows.

A lot of the current conversation treats AI like electricity — a universal layer that powers everything once you plug it in. In our opinion, this is the wrong analogy. AI is more like the electric motor.

When the electric motor was first invented, some factories tried to replace their steam engines directly with electric motors. It did not work well. The factory was optimised for steam. The workflows, the layout, the entire production logic had been built around what steam could do. Replacing the steam engine with an electric motor and expecting transformation produced marginal gains at best and new failure modes at worst.

The factories that got it right eventually did something different. They kept steam engine for the bulk work it was good at, and inserted electric motors precisely where they were needed — small, discrete tasks where the motor's specific characteristics made a difference. The transformation came not from replacing the old engine, but from redesigning the workflow around what the new tool actually did well.

Gen AI is not a steam engine replacement. It is extraordinarily good at reasoning over bounded, high-signal, well-structured contexts. It degrades systematically when asked to navigate large, unstructured documents under token pressure.

The legal workflows that will get this right are the ones that understand the failure modes precisely enough to deploy the tool where it belongs — and to redesign the inputs accordingly:

Feed the model precisely assembled, task-relevant text rather than raw documents.
Make each question discrete.
Break large documents into their component parts before the model reads them, so it reasons over exactly what is relevant rather than navigating everything at once.
Stitch the analysis back together at the workflow level rather than trusting the model to hold 200 pages in reliable attention.

The law firm QA machine needs to change.

The law firm’s goal has not changed

Lawyers give legal advice. They help clients navigate legal problems. That has not changed, and it probably will not change.

What has changed is that tools are being inserted into the workflows — new tools that fail in ways that are different, that are silent, systematic, and invisible to the verification infrastructure built to catch failures in legal work.

We think that, in 2026, the question has moved on from whether or not to use these Gen AI tools. The question is whether the lawyers and the builders understand Gen AI well enough to use it safely.

Those firms that are chatting with us — the ones running into these quality ceilings — they are not failing because they adopted AI, but they are running into design problems, constraints that were not apparent until stress tested in reality.

Silent but deadly is not just a fart joke… it is a key failure that needs a system overhaul.

Horace Wu

Silent but Deadly — Context Rot Problems in Legal

Two kinds of lawyers facing one shared reliability ceiling

The law firm as quality assurance

AI doesn't fail like humans fail

The “footnote trap”

Today’s QA machine cannot fix this

The electric motor, not the steam engine

The law firm’s goal has not changed

Legal

Resources

Follow

Silent but Deadly — Context Rot Problems in Legal

Two kinds of lawyers facing one shared reliability ceiling

The law firm as quality assurance

AI doesn't fail like humans fail

The “footnote trap”

Today’s QA machine cannot fix this

The electric motor, not the steam engine

The law firm’s goal has not changed

Moats and Missiles: How the Castles Will Starve

Your Bandwidth Problem Won’t (Totally) Go Away

Legal

Resources

Follow