The Case for Deterministic Data Before AI
(the following is the first in a series of posts written by one of our lead engineers, Peter Naoum. Peter has over a decade of experience building data infrastructure and backend systems across fintech, IoT, and legal tech. He's been working on text data pipelines since before LLMs were a household word)
The Data Layer
Every week there's a new model, a new benchmark,or a new claim. Somewhere in all this noise, many businesses are doing something risky: they're throwing their data at AI and expecting it to magically answer questions, summarize accurately, rewrite documents, spot risks, and flag issues.
At Syntheia, we took a deliberate step back from the noise a few years ago. Anticipating the hype, we asked ourselves: what are we actually good at?
The answer was the same as it's always been — data. Specifically, making text structured, reliable, and meaningful before humans or AI ever touch it. In our opinion, that's where the real leverage is, and that's what this post is about.
Pilot vs Co-Pilot
Imagine you're about to fly. The airline is a reputable and the pilot is experienced and highly capable. Then, just before you board, someone mentions the pilot’s been having health issues — nothing serious, just a known small chance of a stroke mid-flight, like 1%. You are almost certainly not getting on that plane. And if somehow you did, you would have your eyes glued to the cockpit the entire flight.
And what if it isn’t the pilot with the health issues, but the co-pilot?
That's exactly how we think about AI at Syntheia — and how we think every business should think about it.
If AI is operating an airplane, it should be the co-pilot, and not the pilot. The pilot should give it clear instructions, set the right context, and then thoroughly review the outputs.
Similar to engineering principles, these operating principles also matter for legal teams. We know AI makes mistakes, e.g. hallucinations have improved, but they haven't gone away. If AI is guiding a business decision or flagging a legal risk, you need to know exactly what to trust and what to scrutinize.
“Mostly right, most of the time” isn't good enough when the stakes have real consequence.
If You Torture Data, It Will Confess
Legal documents are written by humans, reviewed by humans, and expected to be read by humans. This means they're full of inconsistency, implicit context, and formatting that doesn’t make sense unless you already know what you're looking at.
Push messy data into an AI and it will give you an answer. It just might not be the *right* answer.
This is where Syntheia decided to play. Over years of working with contracts, we have built an understanding of how these documents are structured — and how to handle them when they are weird. What we call ingestion in our system isn't simply parsing text. For Syntheia, “ingestion” comprise three discrete and important steps:
Extracting the text — word for word, preserving fidelity to the source
Chunking it into meaningful atomic blocks — sections, clauses, schedules, annexures, etc.
Enriching each chunk with metadata — keywords, jurisdictions, dates, parties, clause types, etc.
Critically, this pipeline is built to be deterministic. We deliberately do not use any generative AI in our ingestion pipeline. The same document produces the same output every single time. No guessing, no variance, no surprises.
Every case we handle was torturously built around actual documents — not assumptions about what legal documents might look like, but what they actually look like in practice, including the messy ones.
Less is More
We started working on ingestion of contracts back in 2018. When Gen AI took the limelight, we discovered that our deterministic ingestion pipeline unlocked something surprising — instead of loading an entire contract to an LLM, battling both context window limitations and context rot, and hoping the model finds the right text, we can pass very little text to the model — just the right piece — and gain better performance.
This is what drove the development of our “Document Index” product — documents are transformed into a compact, structured index file that describes the architecture and content of a document. It captures clause references, keywords, titles, parent-child relationships between sections, and other signals that tell a model exactly where to look without needing to read everything.
The benefits are concrete:
More flexible and accurate retrieval. The LLM decides the exact target text to retrieve — along with its children and related clauses — rather than scanning a sea of text. This matters even more at scale, where performance tends to degrade as context windows fill up.
Dramatically lower token usage. You're not passing the full document set into context. You're passing what's relevant. That's faster, cheaper, and more reliable.
We don’t offer a bigger model or a better prompt. We did the hard, unglamorous work of understanding and preparing the data before the AI touches it. The insight is that small, well-structured, relevant input into an LLM beats large, noisy input every time.
Every. Single. Time.
Certainty Is the Product
Most of the AI conversation right now is about what's possible. New capabilities, new benchmarks, new frontiers. That's exciting, and we're watching it closely.
Our focus is on what's provable. Because in legal, certain isn't an optional nice-to-have.
Our bet is that the businesses and legal teams who will get the most from AI aren't the ones who throw the most data at the best model. They will be the ones who invest in the infrastructure underneath — structured ingestion, deterministic outputs, documents that are not just AI-ready but AI-friendly.
That's the foundation we've been building since 2018 — and it's what our Document Index product is built on. If you want to see what certainty looks like in practice, please get in touch with us.
