Why use multiple independent AI readers?

Once hallucinations are no longer the problem, you discover what the real challenge was all along. Two AI readers disagree. Same codebook, same document. And neither is wrong. They just read it differently — because the codebook left room to. The disagreement between them shows you where the codebook's explicit instructions and its implicit intent diverge. The fix is making your intent explicit enough that inference becomes unnecessary. That's what refinement is.

How AI Extraction Works

This started with a problem that has nothing to do with AI.

Ask any statistician and they’ll tell you: a lot of published research has surprisingly basic statistical errors. Statistics is one of those crafts where the early skill is knowing what you don’t yet know — and most researchers don’t get taught that part. We hope peer review catches it. But there just aren’t enough statistical reviewers for the volume of research being published. And the task itself is grand in scope — well-defined only for experienced reviewers who’ve spent years inside a particular kind of study design.

After getting tenure in 2017, I started thinking about this. It’s the kind of problem that’s sticky for statisticians — because you can’t crack it with more mathematics. It’s a problem about communication. About instructions. About what happens when you ask someone to “statistically review this paper” without specifying what that means.

Doug Altman saw a version of this problem decades ago with clinical trials. He couldn’t make researchers run better studies, but he could try to ensure that at least everything was reported — that the methods, the outcomes, the key details were all on the page. That’s what CONSORT is: a checklist, widely adopted across clinical and biomedical research, of twenty-five specific items that every clinical trial should report. Not “is this study good?” but “did they report enough about their process to know how blinding was or wasn’t maintained?” A responsibility shared by authors, journals, editors, and reviewers. He championed it for years, got the major journals on board, and it’s still hard. Reporting quality improved, but not nearly enough.

Altman’s insight shaped how my PhD student Auden Krauska and I thought about the problem. We taught a course at UW-Madison on what we called statistical reading comprehension — not how to write about statistics, but how to read them. The receiving end of scientific communication. And we discovered something: undergrads could do surprisingly rigorous work when given specific, structured instructions. Turn judgment into reading comprehension, and the quality transforms. Clear instructions, applied consistently, produce reliable results. From undergrads. From trained reviewers. From anyone.

Then large language models arrived, and we realized: they’re exactly like good undergrads. Excellent at reading comprehension. Not trustworthy for judgment. The structured instructions we’d been developing for students were already codebooks — and codebooks scale in ways that undergrads never could.

Statistical review was our coding task — the one we knew deeply enough to write real instructions for. But the insight isn’t specific to statistics. Every field has its version: systematic reviewers extracting study characteristics, political scientists coding survey responses, legal teams classifying contract clauses, historians transcribing degraded records. Different domains, same structure. You have documents. You have questions. You need the answers to be consistent, grounded, and auditable.

And clear instructions do something else: they make the method reproducible. A codebook isn’t a prompt you typed into a chatbox and forgot. It isn’t the unwritten instructions, style, and culture implicitly shared with your human coders. It’s the complete specification of what you asked for — shareable, inspectable, testable. Another researcher can read it, run it on their own documents, and know exactly what they’re replicating.

That’s the foundation DataMint is built on. Not “AI can read documents” — everyone knows that now. The foundation is: clear instructions transform performance, for humans and machines alike. The codebook is the method. Everything else follows from how seriously you take it.

Reliable extraction is a solved problem — in about three dozen pieces.

When hallucinations stop being the problem, disagreements become the signal.

Researchers have worked out a few dozen techniques for reliable extraction — layout preservation, evidence grounding, structured schemas, multi-reader consensus, and more. We’ve studied all of them and crafted what we believe is the most rigorous combination available, engineered to work together across text, visual documents, and degraded scans.

Once you’ve done all that — once hallucinations are no longer the problem — you discover what the real challenge was all along. Two AI readers disagree. Same codebook, same document. And neither is wrong. They just read it differently — because the codebook left room to.

Ambiguity is endemic to reality. Every field knows this in its own way.

The job of research — at least the kind that needs a spreadsheet at the end — is to take this mess and impose enough structure that we can have simplified conversations about complex things. That’s what a codebook does — it draws the grid. And the places where careful readers disagree are the places where reality is resisting your grid.

Sometimes the right response to that resistance is to sharpen the grid. And sometimes it’s to refuse the grid entirely — to ask for free text narrative instead of a category, to sit with the complexity for longer rather than forcing a premature answer. A codebook can do both. That’s a research decision, not an extraction limitation.

The first time you mint, the readers will disagree. Often the fix is straightforward — a description that could be read two ways, a category that needs one more option, or a complex idea that is more amenable to a short plain text summary than a long list of categorical codes. You refine the codebook. You re-mint. The obvious ambiguities resolve.

But many disagreements are more interesting than that. The ambiguities are themselves discoveries — rough edges of your thought process, brought to the surface to be refined.

A reader that follows the codebook literally might miss something a human would catch. A reader that infers your intent might “help” in ways you didn’t ask for. Both are reasonable readings of the same instructions. The disagreement between them shows you where the codebook’s explicit instructions and its implicit intent diverge. The fix isn’t telling the readers “don’t infer” — that’s as useless as telling them “don’t hallucinate.” The fix is making your intent explicit enough that inference becomes unnecessary. That’s what refinement is.

Even when you don’t know which way you want the codebook to go, you still want to resolve the ambiguity — as long as it’ll show up often enough to matter. The codebook needs to take a position so the readers can be consistent.

Here’s what we’ve learned from watching this process across hundreds of codebooks: specificity shifts disagreements — it doesn’t eliminate them. You tighten a definition and the readers stop arguing about what counts. Now they argue about how many to include. You specify scope and they agree on scope. Now they disagree about a boundary case your new scope language created. The disagreements get more interesting as the codebook gets sharper. That’s not failure — that’s the codebook maturing.

It’s sometimes helpful to ask: is this disagreement coming more from the document or more from the codebook? A codebook that asks for “the intervention” when the study tested three — that’s mostly a codebook problem. One round of refinement fixes it. A document that contradicts itself across sections — that looks like a document problem. But even that can become a codebook decision: you add a “contradictory” category, or you specify which section takes precedence. The document side and the codebook side play off each other. Refinement is figuring out how much you can close.

What remains after refinement is the endemic layer. The long tail of reality being weird. Rare and idiosyncratic individually, but collectively always there.

How do you handle these rigorously? When do you override a value? When do you flag it and move on? When do you decide the ambiguity itself is worth reporting? This is craft. Qualitative researchers have always known it. Survey designers have always known it. We built a tool that makes it visible in a new way.

That’s how trust forms. Not from a claimed accuracy number. Not from a dashboard that says 98%. From working with the data — refining the codebook, inspecting the disagreements, handling the edge cases — enough times that you know what right looks like. What wrong looks like. And what you decided to do about the rest.

Validation is a relationship you build with your codebook.

Most tools treat validation as a step at the end — run a check, get a score, move on. A score tells you something, but it doesn’t tell you where the codebook is strong and where it’s fragile.

The minting process does. You draft a codebook. You mint. You inspect the results — not just whether values are right, but where multiple AI readers disagreed. Each reader wrote down its reasoning. An arbiter reads those reasonings side by side and traces the disagreement back to the piece of the codebook that caused it. The disagreement becomes a diagnosis of the codebook.

You refine. You re-mint. Some ambiguities resolve. New ones surface. You learn which questions your codebook handles cleanly and which ones it struggles with. Then you refine again. And again — as many rounds as the work still rewards. Each pass sharpens what the codebook handles, and exposes what it still doesn’t.

Over time, you come to know your codebook the way you know any tool you’ve worked with carefully. You know its reach and its limits. You know when to trust it and when to double-check. That knowledge — earned through use, not granted by a metric — is validation.

This is why we don’t show you a single accuracy number and call it done. A number collapses everything into one dimension. The relationship you build with your codebook is richer than that. It’s knowing that questions 1 through 8 are rock solid, that question 9 needs a human eye on historical documents, and that question 10 works beautifully now but didn’t until the third round of refinement. That’s what it means to trust your data.

Minted data carries its provenance — the codebook that shaped it, the process that produced it. And once you trust your codebook, that’s something you can share. A codebook you’ve refined through use carries that refinement with it. A colleague can pick it up, run it on their own collection of documents, and either the codebook handles that new material well — which is replication — or it surfaces new ambiguities that your documents never triggered, which makes the codebook stronger. Trust compounds through sharing. That’s a layer of validation you can’t produce alone.

The codebook is your interface to the mint.

The mint itself is intricate. Routing logic, dozens of model calls per document, context windows, layout reconstruction, verification passes, consensus machinery, deterministic validators — a lot of scaffolding between your documents and your answers. You don’t see any of it.

The codebook is the interface. You work on your research questions — what you want to extract, how to define it, how to handle the edge cases. The mint handles everything else. That separation is the point. Research questions are what you know deeply; AI engineering is what we know deeply. We built the mint so you never have to swap between them.

The mint has many layers of scaffolding. One of them — bloom — is deliberately visible.

Scientific documents often carry their argument in flow charts, diagrams, and tables — visual elements that matter but that language models can’t read directly. When you run a scientific mint, the mint blooms your documents first: a vision model reads each page, extracts the visual elements, and re-expresses them as structured plain text. The readers then reason over the bloomed text alongside the rest of the document. Vision models keep improving, so this piece keeps getting better.

You don’t have to think about bloom — it’s part of the scientific-mint process. But its output is inspectable, page by page. You can see what the vision model pulled out of each figure before the readers run, and confirm the conversion was faithful. When a diagram is carrying a key variable in your research, that visibility matters.

What we kept, and what we didn’t.

No single technique is sufficient. The challenge is combining them into a coherent process that works across clean text, scanned pages, degraded historical records, and visual documents.

And the literature is less settled than it looks. A technique that helps on clean English text can hurt on scanned historical records. A trick that worked with smaller models can be dead weight in larger ones. Move from text to vision and half the assumptions shift. Part of building the mint has been figuring out which techniques generalize, which need to be parameterized per task, and which turn out to make things worse in our context.

A fuller catalog — eight categories of technique and what each one does — lives on its own page.

See the techniques catalog →

The path forward

Documents resist. Codebooks improve. The data gets sharper. And a codebook refined through use embodies everything you learned — the decisions you made, the ambiguities you resolved, the edge cases you learned to handle. Share it, and another researcher inherits your clarity — and refines it further for their documents and their questions.

Reproducibility as a principle and a practice.

Karl Rohe is a Professor of Statistics at the University of Wisconsin-Madison, with a courtesy appointment in Educational Psychology, and the creator of datamint.ing.

Data Mint

How AI extraction works