Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Yang, Shan

TL;DR

Multimodal physics evaluation has three undetected construction practices that distort how the field measures vision-language reasoning: train–eval contamination (single-stage 5-gram-Jaccard audits report zero hits where a three-stage audit surfaces 134 SciInstruct near-duplicates), translation drift (Sonnet 4.5 attains 30.5% on Estonian-original olympiad problems vs. 13.6% on English translations of the same problems), and MCQ saturation (a 46-pp same-model score gradient between PhyX 4-way MCQ and open-ended olympiad evaluation). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem 99.8%-novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe that lifts PhysOlym-A liberal by +17.6 pp over the Qwen3-VL-8B-Thinking base.

Three Findings

134

SciInstruct near-duplicates

Single-stage audit reports clean

UGPhysics-Train, SciInstruct, and MMK12 pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals. A three-stage audit (Jaccard → mxbai-embed-large cosine → Haiku-4.5 LLM-judge) surfaces 4,846 paraphrase candidates and 134 close-duplicates in SciInstruct alone.

17 pp

ET → EN translation delta

Translation underestimates ability

On 59 paired Estonian/English olympiad problems, Sonnet 4.5 attains 30.5% strict on Estonian originals vs. 13.6% on English translations of the same problems (sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp).

46 pp

format–novelty gradient

Same model, three benchmarks

On identical Sonnet 4.5 weights: 79.7% on PhyX (4-way MCQ) → 50.4% on OlympiadBench-Physics (open-ended) → 33.4% on PhysOlym-A (open-ended, novel-source, audited). Format and novelty alone move the score 46 points on fixed weights.

The Three-Stage Audit Pipeline

Pairwise across the training pool and six public physics evals; pseudocode in audit/.

Stage 1 — n-gram Jaccard ≥ 0.4

Tokenize each problem statement with a unicode word tokenizer, build the 5-gram shingle set, flag pairs by Jaccard. Catches verbatim duplication; misses paraphrase-class contamination entirely.

Stage 2 — embedding cosine ≥ 0.85

Encode each statement with mxbai-embed-large-v1 (1024-dim, L2-normalized); flag pairs by cosine similarity. High recall over close-content pairs; also flags same-topic-but-distinct-problem pairs (false positives).

Stage 3 — Haiku-4.5 LLM-judge precision filter

For each Stage-2 candidate, a Haiku-4.5 judge classifies the pair as a close duplicate (paraphrase / numeric variation of the same problem) or a same-topic neighbor (related physics, distinct setup). Only Stage-3 close-duplicates are removed. Cosine-bucketed precision: 100% close-dup at cos ≥ 0.95; 1.5% at cos ∈ [0.85, 0.87).

Robustness: Embedder Spearman ρ = 0.78 vs. OpenAI text-embedding-3-large (OpenAI candidate set is a strict subset of mxbai's at every threshold). Cross-judge agreement on the Sonnet-as-judge protocol over a 50-problem PhysOlym-A subset: Cohen's κ = 0.44 vs. GPT-4o, with GPT-4o the more lenient judge (self-grading direction is opposite to the feared bias).

Released Artifacts

Artifact	Size	Purpose	Hosting
PhysCorp-A	6,432 records	Audited multimodal physics corpus (fully Stage-3 clean against six public evals).	🤗 HuggingFace
PhysR1Corp	2,268 records	Closed-form RL training pool (numeric / MCQ-gradable carve-out).	🤗 HuggingFace
PhysOlym-A	500 problems	Held-out olympiad eval, 99.8% novel-source, EN/ET bilingual, native difficulty.	🤗 HuggingFace
PhysCorp-pre-audit	14,294 records	Raw pre-audit pool — released so users can re-run the audit.	🤗 HuggingFace
Physics-R1 (recipe)	code + config	GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking with binary correctness reward.	GitHub
Audit pipeline	Python	Three-stage contamination audit (`audit/`) + saved best-overlap scores + judge labels.	GitHub

Dataset Compositions

Hover for exact counts. Click a slice to isolate it.

PhysCorp-A

6,432 records · by source family

PhysR1Corp

2,268 records · numeric vs MCQ-gradable

PhysOlym-A

500 problems · by olympiad source

PhysCorp-A: 1,609 first-ML olympiad records (Estonian PhO, Zhou, IPhO+NBPhO+EuPhO, APhO+USAPhO+INPhO) preserved in full; remaining 4,823 from repackaged sources after the three-stage audit (per-family counts shown are approximate, allocated proportionally from the pre-audit pool; exact counts available in the released dataset card).

Physics-R1 Results

Liberal accuracy (Sonnet-as-judge); subscripts are Δ vs. Qwen3-VL-8B-Thinking base.

Model	PhyX-1k	PhyX-3k	PhysReason	PUB-OE	OlymBench	PhysOlym-A
Closed-source frontier
Claude Sonnet 4.5	79.7	80.6	36.6	37.7	50.4	33.4
Gemini 2.5 Pro	75.1	49.8	38.8	33.4	37.4	12.2
GPT-4o	70.4	53.6	48.9	31.5	19.7	8.0
Open-source bases
Qwen3-VL-32B-Thinking	73.8	84.2	25.1	32.8	53.9	13.2
Qwen3-VL-8B-Thinking (base)	73.7	74.4	23.9	35.3	39.3	8.0
InternVL3-8B	46.8	43.1	13.3	23.5	10.4	4.0
Physics-R1 (binary, 3-seed mean ± σ)	77.8 ± 0.3 +4.1	76.9 ± 0.3 +2.5	39.6 ± 6.4 +15.7	34.8 ± 3.3 −0.5	46.2 ± 1.5 +6.9	26.3 ± 1.7 +18.3
Physics-R1 (dense, ablation)	78.3 +4.6	77.5 +3.1	23.3	36.7 +1.4	40.5 +1.2	19.2 +11.2

Physics-R1 binary lifts PhysOlym-A liberal +18.3 pp over the 8B base, still 7.1 pp below Sonnet 4.5. Recipe is reported as evidence the audited corpus is trainable, not as a SOTA capability claim.

3-seed sweep on PhysOlym-A liberal: seed-42 = 25.6, seed-17 = 25.0, seed-23 = 28.2 (mean 26.3 ± 1.7); seeds {42, 17, 23} on the audited PhysR1Corp.

Quick start

Run the audit pipeline

git clone https://github.com/shanyang-me/physics-r1-code
cd physics-r1-code
pip install -r requirements.txt

python audit/audit_two_stage.py \
    --train_jsonl your_pool.jsonl \
    --eval_jsonl  data/physolym_a.jsonl \
    --jaccard_thr 0.4 \
    --cosine_thr 0.85 \
    --emit report.json

Load PhysOlym-A

from datasets import load_dataset
ds = load_dataset("shanyangmie/physolym-a", split="test")
print(ds[0])  # {source, messages, solution, metadata}

BibTeX

@misc{yang2026physicsr1,
  title  = {Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning},
  author = {Yang, Shan},
  year   = {2026}
}