Pairwise across the training pool and six public physics evals; pseudocode in audit/.
Stage 1 — n-gram Jaccard ≥ 0.4
Tokenize each problem statement with a unicode word tokenizer, build the 5-gram shingle set, flag pairs by Jaccard. Catches verbatim duplication; misses paraphrase-class contamination entirely.
Stage 2 — embedding cosine ≥ 0.85
Encode each statement with mxbai-embed-large-v1 (1024-dim, L2-normalized); flag pairs by cosine similarity. High recall over close-content pairs; also flags same-topic-but-distinct-problem pairs (false positives).
Stage 3 — Haiku-4.5 LLM-judge precision filter
For each Stage-2 candidate, a Haiku-4.5 judge classifies the pair as a close duplicate (paraphrase / numeric variation of the same problem) or a same-topic neighbor (related physics, distinct setup). Only Stage-3 close-duplicates are removed. Cosine-bucketed precision: 100% close-dup at cos ≥ 0.95; 1.5% at cos ∈ [0.85, 0.87).
text-embedding-3-large
(OpenAI candidate set is a strict subset of mxbai's at every threshold). Cross-judge agreement on the Sonnet-as-judge
protocol over a 50-problem PhysOlym-A subset: Cohen's κ = 0.44 vs. GPT-4o, with GPT-4o the more lenient judge
(self-grading direction is opposite to the feared bias).