Physics-R1

An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

TL;DR

Multimodal physics evaluation has three undetected construction practices that distort how the field measures vision-language reasoning: train–eval contamination (single-stage 5-gram-Jaccard audits report zero hits where a three-stage audit surfaces 134 SciInstruct near-duplicates), translation drift (Sonnet 4.5 attains 30.5% on Estonian-original olympiad problems vs. 13.6% on English translations of the same problems), and MCQ saturation (a 46-pp same-model score gradient between PhyX 4-way MCQ and open-ended olympiad evaluation). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem 99.8%-novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe that lifts PhysOlym-A liberal by +17.6 pp over the Qwen3-VL-8B-Thinking base.

Three Findings

134
SciInstruct near-duplicates

Single-stage audit reports clean

UGPhysics-Train, SciInstruct, and MMK12 pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals. A three-stage audit (Jaccard → mxbai-embed-large cosine → Haiku-4.5 LLM-judge) surfaces 4,846 paraphrase candidates and 134 close-duplicates in SciInstruct alone.

17 pp
ET → EN translation delta

Translation underestimates ability

On 59 paired Estonian/English olympiad problems, Sonnet 4.5 attains 30.5% strict on Estonian originals vs. 13.6% on English translations of the same problems (sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp).

46 pp
format–novelty gradient

Same model, three benchmarks

On identical Sonnet 4.5 weights: 79.7% on PhyX (4-way MCQ) → 50.4% on OlympiadBench-Physics (open-ended) → 33.4% on PhysOlym-A (open-ended, novel-source, audited). Format and novelty alone move the score 46 points on fixed weights.

The Three-Stage Audit Pipeline

Pairwise across the training pool and six public physics evals; pseudocode in audit/.

Stage 1 — n-gram Jaccard ≥ 0.4

Tokenize each problem statement with a unicode word tokenizer, build the 5-gram shingle set, flag pairs by Jaccard. Catches verbatim duplication; misses paraphrase-class contamination entirely.

Stage 2 — embedding cosine ≥ 0.85

Encode each statement with mxbai-embed-large-v1 (1024-dim, L2-normalized); flag pairs by cosine similarity. High recall over close-content pairs; also flags same-topic-but-distinct-problem pairs (false positives).

Stage 3 — Haiku-4.5 LLM-judge precision filter

For each Stage-2 candidate, a Haiku-4.5 judge classifies the pair as a close duplicate (paraphrase / numeric variation of the same problem) or a same-topic neighbor (related physics, distinct setup). Only Stage-3 close-duplicates are removed. Cosine-bucketed precision: 100% close-dup at cos ≥ 0.95; 1.5% at cos ∈ [0.85, 0.87).

Robustness: Embedder Spearman ρ = 0.78 vs. OpenAI text-embedding-3-large (OpenAI candidate set is a strict subset of mxbai's at every threshold). Cross-judge agreement on the Sonnet-as-judge protocol over a 50-problem PhysOlym-A subset: Cohen's κ = 0.44 vs. GPT-4o, with GPT-4o the more lenient judge (self-grading direction is opposite to the feared bias).

Released Artifacts

Artifact Size Purpose Hosting
PhysCorp-A 6,432 records Audited multimodal physics corpus (fully Stage-3 clean against six public evals). 🤗 HuggingFace
PhysR1Corp 2,268 records Closed-form RL training pool (numeric / MCQ-gradable carve-out). 🤗 HuggingFace
PhysOlym-A 500 problems Held-out olympiad eval, 99.8% novel-source, EN/ET bilingual, native difficulty. 🤗 HuggingFace
PhysCorp-pre-audit 14,294 records Raw pre-audit pool — released so users can re-run the audit. 🤗 HuggingFace
Physics-R1 (recipe) code + config GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking with binary correctness reward. GitHub
Audit pipeline Python Three-stage contamination audit (audit/) + saved best-overlap scores + judge labels. GitHub

Dataset Compositions

Hover for exact counts. Click a slice to isolate it.

PhysCorp-A

6,432 records · by source family

PhysR1Corp

2,268 records · numeric vs MCQ-gradable

PhysOlym-A

500 problems · by olympiad source

PhysCorp-A: 1,609 first-ML olympiad records (Estonian PhO, Zhou, IPhO+NBPhO+EuPhO, APhO+USAPhO+INPhO) preserved in full; remaining 4,823 from repackaged sources after the three-stage audit (per-family counts shown are approximate, allocated proportionally from the pre-audit pool; exact counts available in the released dataset card).

Physics-R1 Results

Liberal accuracy (Sonnet-as-judge); subscripts are Δ vs. Qwen3-VL-8B-Thinking base.

Model PhyX-1k PhyX-3k PhysReason PUB-OE OlymBench PhysOlym-A
Closed-source frontier
Claude Sonnet 4.579.780.636.637.750.433.4
Gemini 2.5 Pro75.149.838.833.437.412.2
GPT-4o70.453.648.931.519.78.0
Open-source bases
Qwen3-VL-32B-Thinking73.884.225.132.853.913.2
Qwen3-VL-8B-Thinking (base)73.774.423.935.339.38.0
InternVL3-8B46.843.113.323.510.44.0
Physics-R1 (binary, 3-seed mean ± σ) 77.8 ± 0.3 +4.1 76.9 ± 0.3 +2.5 39.6 ± 6.4 +15.7 34.8 ± 3.3 −0.5 46.2 ± 1.5 +6.9 26.3 ± 1.7 +18.3
Physics-R1 (dense, ablation) 78.3 +4.6 77.5 +3.1 23.3 36.7 +1.4 40.5 +1.2 19.2 +11.2

Physics-R1 binary lifts PhysOlym-A liberal +18.3 pp over the 8B base, still 7.1 pp below Sonnet 4.5. Recipe is reported as evidence the audited corpus is trainable, not as a SOTA capability claim.

3-seed sweep on PhysOlym-A liberal: seed-42 = 25.6, seed-17 = 25.0, seed-23 = 28.2 (mean 26.3 ± 1.7); seeds {42, 17, 23} on the audited PhysR1Corp.

Quick start

Run the audit pipeline

git clone https://github.com/shanyang-me/physics-r1-code
cd physics-r1-code
pip install -r requirements.txt

python audit/audit_two_stage.py \
    --train_jsonl your_pool.jsonl \
    --eval_jsonl  data/physolym_a.jsonl \
    --jaccard_thr 0.4 \
    --cosine_thr 0.85 \
    --emit report.json

Load PhysOlym-A

from datasets import load_dataset
ds = load_dataset("shanyangmie/physolym-a", split="test")
print(ds[0])  # {source, messages, solution, metadata}

BibTeX

@misc{yang2026physicsr1,
  title  = {Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning},
  author = {Yang, Shan},
  year   = {2026}
}