What disagreement cannot tell you.
High agreement doesn't mean you measured the right thing.
Three AI readers. One question. One document.
Reader A
Yes
Reader B
Yes
Reader C
Yes
3 / 3 agreed
But they're all wrong.And across many documents, the same story can play out: high inter-rater reliability doesn't mean you measured the right thing.
Correlated errors. The three readers were trained on overlapping corpora. They share the same misconception, confidently. Agreement is not independence.
Construct drift. The codebook's wording captures a related-but-different idea. The readers execute it faithfully. Cronbach's α won't notice.
Language blind spots. All three readers recognise "double-blind" but miss "single-blind, outcome-masked." The corpus shapes their agreement more than your codebook does.
a practical minimum
for stress-testing correlated error — not a settled standard
- ≥ 3 readers
- ≥ 2 base-model families (Data Mint runs Gemini alongside open-weights — Kimi, GPT-OSS, Llama — via DeepInfra)
- same codebook, same document, separate runs
- self-consistency from one model ≠ multiple readers
cf.Cronbach & Meehl (1955) construct validity · Hacking (1995) looping effects · Krippendorff (2004) · Lakshminarayanan et al. (2017) ensemble diversity
Disagreement is a useful variance diagnostic,
not a validity oracle.
One layer of validation. Built through use, over time.