← back

Idea: One-Step RLVR: What If Reinforcement Learning Only Needed One Gradient Step?

April 12, 2026

Note: see research proposal slide deck

Core idea

Key observations motivating the idea

The proposed shift

Method

1. Sample a massive rollout batch from the base policy

2. Score rollouts with a verifier

3. Compute one policy gradient step

4. Line-search over step size

Why this might work

Overshoot and stability

How to test the hypothesis

Useful diagnostics

Cosine similarity to a real RL trajectory

Cross-subset consistency

Correct vs. incorrect decomposition

Relation to no-RL alternatives

Failure modes

Fallback outcome

The broader claim

Closing