It has always been a chunking problem (and it always will be)

Over the past couple of months, we have been having a lot of conversations with law firms who are growing more impressed with their AI platforms and a handful of conversations with law firms who are growing quietly frustrated with their AI platforms. There is a growing divergence of opinion.

For those who are frustrated, “quietly” is the right word. Nobody is putting out press releases about this. The frustration tends to surface in private conversations, and always after the initial deployment and rollout period has been replaced by the reality of every day use. The Gen AI platforms work. They are impressive for many things. But, lawyers keep running into ceilings.

We wrote about the 60% problem last year. The number seems to have moved. The foundation models have gotten better. But the ceiling has not gone away, and for top-tier firms doing complex transactional work, it is now perhaps a “75% problem”. 75% is still the difference between a useful starting point and something you simply cannot send to a client. If a lawyer cannot trust that the AI has caught everything, or that what it caught is correct in context, they might as well start from scratch. And that is exactly what many lawyers end up doing.

The specific task that crystallised this for us was the “issues list”.

What is the problem with issues lists?

For those outside the transactional world, an issues list is what a lawyer prepares after reviewing deal contracts — they are a structured summary of everything that needs a decision before the deal can move forward. It sounds like a summary, but the issues list is a synthesis of three distinct types of analysis happening simultaneously.

  1. The first is whether the document reflects the client's instructions. Does this draft actually do what the client told us they wanted? This requires keeping the client's position in mind while reading every provision — which means the client's instructions need to be in the lawyer's head, or in the LLM’s context, while the review happens.

  2. The second is internal consistency. Does the document make sense on its own terms? Are there provisions that contradict each other, defined terms used inconsistently, cross-references that point to the wrong clause? LLMs are actually quite good at this, provided the document is not too long. For shorter agreements — NDAs, simple commercial contracts — AI platforms handle internal consistency questions reasonably well these days.

  3. The third is reference comparison. How does this draft compare to the last version? How does it compare to what the firm has seen in similar deals? This is where things get harder, because it requires having done the comparison before the issues list can be written, and it requires understanding not just what changed but whether the changes matter — whether they represent acceptable movement or genuine problems.

About two months ago, a firm we work with came to us with exactly this problem. They had been trying to use a general AI platform to produce issues lists from blacklined agreements. They found that using Gen AI for shorter documents worked reasonably well, but anything long, say a 100 page document, the AI platform would start to miss things. It did not fail completely, but it failed in specific and risky ways: it would produce a plausible-looking issues list that missed changes, or flagged terms that weren’t changed, or identified a change without understanding what the change meant because it did not have the context to interpret it. Lawyers could not use the output, because they could not trust the output and preferred to read the whole document themselves.

We got pulled into this conversation, and we are now working on a solution together — one where we break down the contracts into their component provisions first, and that structured data is then fed into the AI platform alongside the firm's own knowledge and prompting.

Working through this problem helped us realize something about the market that we think has been missed.

Solving Retrieval and Generation

The market has trying to solve problems with LLMs by improving either “retrieval” or “generation”.

  • On one side, there is the generation camp. The bet here is that better models solve everything. More capable reasoning. Better fine-tuning on legal language. More sophisticated understanding of legal concepts buried in the model's weights. This has been real progress, and we do not want to understate it — the models have gotten remarkably better.

  • On the other side, there is the retrieval camp. Better ways to find relevant documents. Vector databases. Hybrid search. Reranking. Graph databases. Sophisticated RAG pipelines that retrieve relevant results and present them to the model. This is also real engineering, and some of it is genuinely impressive.

The problem is that both camps are treating retrieval and generation as separate problems to be optimized independently. And they are not separate. They are one problem when it comes to improving LLM performance for specific and contextual problems.

The quality of what a model generates is determined entirely by what is in its context window at the moment of generation. You can have the most capable model in the world, but if you feed it the wrong information — too much, too little, the wrong unit, stripped of the context that gives it meaning — the output will be wrong. Not randomly wrong. Confidently, plausibly, specifically wrong in ways that are hard to detect.

This is why the framing of retrieval and generation as separate camps is so misleading. Retrieval determines what goes into the context window. The context window determines what the model “knows”. The model's output is a function of that context window. You cannot improve the output by optimising the model in isolation from what you are feeding it. No amount of prompting can save you from not having the right information in the context window. We are seeing this directly, in production, with the issues list.

The problem was not the model. The problem was that there was no good way to feed the model the right chunks of the documents at the right time.

How Do Lawyers Actually Read?

There is a thing that every transactional lawyer does automatically and invisibly, and that I think many of the Gen AI platforms and builders are missing… When a lawyer reads a contract, they do not read it from start to finish. Contracts are not written to be read from start to finish. They are written as an ensemble of interdependent provisions, each of which is triggered under certain conditions, each of which qualifies or is qualified by others. The lawyer navigates this system. They jump around. They follow cross-references. They hold defined terms in memory and apply them as they encounter subsequent provisions. They “unfurl” the contract as they go — not reading the words sequentially, but assembling the logic from its component pieces.

When a lawyer comes across "Material Adverse Effect" in clause 7.2, they flip back to the definitions article and remind themselves of exactly what that term means in this specific agreement, negotiated by these specific parties, for this specific transaction. Then, they apply that precise meaning to clause 7.2 and understand what it actually does.

This is not a “sophisticated” skill you learn from three years of law school. It is just what reading a contract means. This type of page-flipping reading is so fundamental to lawyers that it doesn’t really even have a name (though, internally, we call it “unfurling”).

We think a lot of legal AI skip this step.

We think they skipped this step because it was invisible and assumed. The industry extracts text from PDFs. The industry infers embedding for the text. The industry finds similar chunks in retrieval pipelines. The industry feeds search results LLMs. But, it seems most people do not replicate the thing lawyers do with documents — they deconstruct the document into its logical building blocks, and then retrieves and assembles only the right blocks for answering specific questions.

Contracts were never meant to be fed to a model whole. They were never meant to be chunked by page number or token count either. When you feed a 100-page agreement to a model as a block of text, you are asking the model to navigate a system that was designed to be navigated by someone who already understands its structure. When you chunk it by page, you are cutting across the logical boundaries of provisions in ways that destroy meaning. The deletion of a cross-reference to clause 4.3 is consequential — but you cannot understand why unless you know what clause 4.3 says.

Contracts were never written to be read from start to finish. They were written as an ensemble of provisions that connect and qualify each other. Legal AI needs to read them the same way.

The Context Window and the Provision

It is worth being emphatic about this.

The context window is not a technical parameter. It is the entirety of what a language model can see at any given moment. Every inference, every answer, every issues list, every due diligence finding — all of it is produced from what is in the context window. The model cannot reason about what is not in there. It cannot retrieve missing context from somewhere else. It works with what it is given.

This means that the decision of what to put in the context window — what to retrieve, in what form, carrying what structural information — is the most important design decision in the entire stack. More important than which model you use. More important than how you prompt it. If you get the context wrong, a better model just produces a more confident wrong answer.

The question of what belongs in the context window for a given legal task does not have a fixed answer.

  • For the issues list, you need the changed provisions plus the parent clauses that give them their structural meaning plus the cross-referenced provisions that they implicate. You need enough context to understand why each change matters, but not so much surrounding text that the signal drowns.

  • For due diligence on change of control clauses, you need every provision of that type across the entire portfolio, each with enough context to be correctly interpreted, none with so much surrounding agreement text that the model loses focus.

  • For benchmarking a liability cap, you need comparable provisions from comparable transactions — the right unit at the right granularity, stripped of everything irrelevant.

Different task. Different context. Different assembly of provisions. What stays constant is that the unit of assembly is the provision — not the page, not the token window, not the paragraph… it is all about the provision: the discrete piece of legal meaning that lawyers have used as their working unit for as long as contracts have existed.

Nothing Plain about Plain Text

We have thought carefully about what the right data structure is for storing these provisions so they can be retrieved correctly and assembled into the right context windows.

Our answer is plain text, hierarchically structured, with metadata that preserves each provision's position in the document, its type, its cross-references, and its location in the source. Not a proprietary format. Not a specialised embedding model. Not a knowledge graph.

Plain text.

  • Plain text is operable by any model, any retrieval system, any search technology — now and whatever comes next.

  • It is readable by a human lawyer who needs to check what the system retrieved.

  • It is auditable: you can see exactly what went into the context window and trace any output back to its source.

  • It is reversible: a provision extracted from a document as structured plain text can be operated on and then reassembled back into a correctly formatted document.

The model can only do as well as what you put in its context window. And what you put in its context window is only as good as how the documents were deconstructed and stored in the first place.

The design decision for context engineering that has been most underinvested. Everyone has been so focused on retrieval versus generation that they have missed the thing that sits underneath both — the data structure that makes retrieval meaningful and generation reliable.

Solving the Problem at the Root

Someone said something to us at the beginning of this Gen AI wave, and we keep coming back to this:

This has always been a chunking problem. And it will always be a chunking problem.

Context windows will get larger. Models will get smarter. Retrieval techniques will get more sophisticated. None of that changes the fundamental requirement: what are the right pieces, for this task, assembled in what order, carrying what context?

That question does not go away with scale.

Next
Next

What Does Syntheia Sell? Why Not Gen AI?