How AI extraction works

The philosophy and engineering behind minting data from documents.

This started with a problem that has nothing to do with AI.

Ask any statistician and they’ll tell you: a lot of published research has surprisingly basic statistical errors. Statistics is one of those crafts where the early skill is knowing what you don’t yet know — and most researchers don’t get taught that part. We hope peer review catches it. But there just aren’t enough statistical reviewers for the volume of research being published. And the task itself is grand in scope — well-defined only for experienced reviewers who’ve spent years inside a particular kind of study design.

After getting tenure in 2017, I started thinking about this. It’s the kind of problem that’s sticky for statisticians — because you can’t crack it with more mathematics. It’s a problem about communication. About instructions. About what happens when you ask someone to “statistically review this paper” without specifying what that means.

Reliable extraction is a solved problem — in about three dozen pieces.

When hallucinations stop being the problem, disagreements become the signal.

Researchers have worked out a few dozen techniques for reliable extraction — layout preservation, evidence grounding, structured schemas, multi-reader consensus, and more. We’ve studied all of them and crafted what we believe is the most rigorous combination available, engineered to work together across text, visual documents, and degraded scans.

Once you’ve done all that — once hallucinations are no longer the problem — you discover what the real challenge was all along. Two AI readers disagree. Same codebook, same document. And neither is wrong. They just read it differently — because the codebook left room to.

Validation is a relationship you build with your codebook.

Most tools treat validation as a step at the end — run a check, get a score, move on. A score tells you something, but it doesn’t tell you where the codebook is strong and where it’s fragile.

The minting process does. You draft a codebook. You mint. You inspect the results — not just whether values are right, but where multiple AI readers disagreed. Each reader wrote down its reasoning. An arbiter reads those reasonings side by side and traces the disagreement back to the piece of the codebook that caused it. The disagreement becomes a diagnosis of the codebook.

You refine. You re-mint. Some ambiguities resolve. New ones surface. You learn which questions your codebook handles cleanly and which ones it struggles with. Then you refine again. And again — as many rounds as the work still rewards. Each pass sharpens what the codebook handles, and exposes what it still doesn’t.

Over time, you come to know your codebook the way you know any tool you’ve worked with carefully. You know its reach and its limits. You know when to trust it and when to double-check. That knowledge — earned through use, not granted by a metric — is validation.

This is why we don’t show you a single accuracy number and call it done. A number collapses everything into one dimension. The relationship you build with your codebook is richer than that. It’s knowing that questions 1 through 8 are rock solid, that question 9 needs a human eye on historical documents, and that question 10 works beautifully now but didn’t until the third round of refinement. That’s what it means to trust your data.

Minted data carries its provenance — the codebook that shaped it, the process that produced it. And once you trust your codebook, that’s something you can share. A codebook you’ve refined through use carries that refinement with it. A colleague can pick it up, run it on their own collection of documents, and either the codebook handles that new material well — which is replication — or it surfaces new ambiguities that your documents never triggered, which makes the codebook stronger. Trust compounds through sharing. That’s a layer of validation you can’t produce alone.

The codebook is your interface to the mint.

The mint itself is intricate. Routing logic, dozens of model calls per document, context windows, layout reconstruction, verification passes, consensus machinery, deterministic validators — a lot of scaffolding between your documents and your answers. You don’t see any of it.

The codebook is the interface. You work on your research questions — what you want to extract, how to define it, how to handle the edge cases. The mint handles everything else. That separation is the point. Research questions are what you know deeply; AI engineering is what we know deeply. We built the mint so you never have to swap between them.

The mint has many layers of scaffolding. One of them — bloom — is deliberately visible.

Scientific documents often carry their argument in flow charts, diagrams, and tables — visual elements that matter but that language models can’t read directly. When you run a scientific mint, the mint blooms your documents first: a vision model reads each page, extracts the visual elements, and re-expresses them as structured plain text. The readers then reason over the bloomed text alongside the rest of the document. Vision models keep improving, so this piece keeps getting better.

You don’t have to think about bloom — it’s part of the scientific-mint process. But its output is inspectable, page by page. You can see what the vision model pulled out of each figure before the readers run, and confirm the conversion was faithful. When a diagram is carrying a key variable in your research, that visibility matters.

What we kept, and what we didn’t.

No single technique is sufficient. The challenge is combining them into a coherent process that works across clean text, scanned pages, degraded historical records, and visual documents.

And the literature is less settled than it looks. A technique that helps on clean English text can hurt on scanned historical records. A trick that worked with smaller models can be dead weight in larger ones. Move from text to vision and half the assumptions shift. Part of building the mint has been figuring out which techniques generalize, which need to be parameterized per task, and which turn out to make things worse in our context.

A fuller catalog — eight categories of technique and what each one does — lives on its own page.

See the techniques catalog →

The path forward

Documents resist. Codebooks improve. The data gets sharper. And a codebook refined through use embodies everything you learned — the decisions you made, the ambiguities you resolved, the edge cases you learned to handle. Share it, and another researcher inherits your clarity — and refines it further for their documents and their questions.

Reproducibility as a principle and a practice.