How do we know if the data extracted is valid and reliable?

This has nothing to do with AI. Validating data extracted from documents has been the hard problem for decades. The AI is a new reader. The craft is old.

1850

Textual criticism. Karl Lachmann's stemmatic method for reconstructing a text from many flawed documentary witnesses — compare, track shared errors, reconstruct the most defensible reading. The deep ancestor of extraction with an audit trail.

1927
–1952

Content analysis. Harold Lasswell makes political text into analyzable evidence. Bernard Berelson's Content Analysis in Communication Research (1952) formalizes the codebook — coding texts into variables by explicit rules.

1955
–1960

The math of agreement. William A. Scott's π (1955), then Jacob Cohen's κ (1960): the first chance-corrected measures of whether two readers extracted the same thing — beyond luck.

1979

Noisy readers, no answer key. Dawid & Skene show how to estimate the true label from multiple error-prone readers — even without a gold standard. The direct ancestor of ensemble-disagreement reliability.

1993

The protocol. Iain Chalmers founds the Cochrane Collaboration. Duplicate extraction, standardised critical appraisal, audit trails — extraction from documents becomes a regulated craft.

2022+

A new reader. Large language models make the codebook executable at scale. Everything upstream — construct definition, codebook design, adjudication, agreement, provenance — unchanged.

what the mint grade adds to the 1979 math

Dawid & Skene modeled each reader with a fixed error rate. A modern AI reader's error shifts by prompt, by context, by model family. The mint grade extends the 1979 framework to that heterogeneous regime — calibrated across multiple reader signals and (soon) against human-labelled ground truth.

The reader is new. The craft of checking the read is older than the Turing test.

@karlrohe