What disagreement cannot tell you.

High agreement doesn't mean you measured the right thing.

Three AI readers. One question. One document.

Reader A

Yes

Reader B

Yes

Reader C

Yes

3 / 3 agreed

But they're all wrong.And across many documents, the same story can play out: high inter-rater reliability doesn't mean you measured the right thing.

Correlated errors. The three readers were trained on overlapping corpora. They share the same misconception, confidently. Agreement is not independence.

Construct drift. The codebook's wording captures a related-but-different idea. The readers execute it faithfully. Cronbach's α won't notice.

Language blind spots. All three readers recognise "double-blind" but miss "single-blind, outcome-masked." The corpus shapes their agreement more than your codebook does.

a practical minimum

for stress-testing correlated error — not a settled standard

≥ 3 readers
≥ 2 base-model families (Data Mint runs Gemini alongside open-weights — Kimi, GPT-OSS, Llama — via DeepInfra)
same codebook, same document, separate runs
self-consistency from one model ≠ multiple readers

cf.Cronbach & Meehl (1955) construct validity · Hacking (1995) looping effects · Krippendorff (2004) · Lakshminarayanan et al. (2017) ensemble diversity

Disagreement is a useful variance diagnostic,
not a validity oracle.

One layer of validation. Built through use, over time.

@karlrohe