April 12, 2026
Note: see research proposal slide deck
Reinforcement learning with verifiable rewards, or RLVR, has become a key ingredient in post-training language models.
It works especially well in domains like:
But RLVR is expensive.
Standard pipelines often require:
The default assumption is that this long iterative process is necessary.
The core question of this proposal is:
One-Shot RLVR is based on a simple hypothesis:
In that regime, RLVR may not be doing extended search.
It may just be:
Recent work suggests RLVR training is often surprisingly linear.
Across algorithms and model sizes:
This suggests the optimization direction may not change much over training.
Other work suggests pretrained models are surrounded by many nearby task-expert solutions in parameter space.
These solutions appear to be:
They are specialists rather than generalists, but they may still be close enough to reach with minimal movement.
Put together, these observations suggest:
That leads to the main question:
Standard RLVR allocates compute to depth:
One-Shot RLVR allocates compute to breadth:
The idea is to spend compute estimating the first step extremely well, rather than repeatedly re-estimating similar steps over time.
Start from the base model
Sample a very large number of rollouts
Use relatively high temperature to increase coverage and diversity
Use a binary verifier to label trajectories as correct or incorrect
Examples:
The update direction may be good, but the correct step size is unknown
A step that is too small wastes signal
A step that is too large overshoots the useful region
So instead of committing to one learning rate:
Standard RLVR may be wasting compute re-estimating the same direction over and over
If the training trajectory is mostly linear, then later steps are not discovering new behavior
They are mostly continuing along the direction already found near the beginning
In that view, standard RLVR looks like:
One-Shot RLVR instead says:
So the real tradeoff may be:
The main risk is overshooting
A large update could move the model past the good region and into a worse one
That is why line search is central
We explicitly evaluate different update magnitudes and choose the best one
Another useful free diagnostic is entropy
If entropy collapses sharply as step size increases:
The expected pattern is:
The cleanest experiment is to compare one-shot training with standard RLVR at fixed rollout budget
Example:
Both use the same total amount of rollout data
The only difference is how that compute is allocated
If one-shot RLVR matches or approaches the performance of standard RLVR, that would suggest much of the sequential training was unnecessary
This proposal also connects to methods that skip RL entirely, such as:
These baselines are useful because they test whether policy gradient machinery is even needed
Still, One-Shot RLVR preserves something important:
The method may fail when the base model pass rate is too low
If the large rollout batch contains almost no correct examples:
Code may be harder than math because rewards are often sparser
Very small models may also be poor candidates because:
Another risk is overfitting to the sampled batch rather than learning a robust generalizable direction
Even if exactly one step is too aggressive or too brittle, the idea may still hold in a softened form
For example:
That would still be a meaningful simplification
So the main scientific object is not just whether k = 1 works
It is the broader Pareto frontier between batch size and number of steps
The provocative takeaway is:
If:
then standard RLVR may be solving the wrong optimization problem
It may be treating learning as a long sequential search problem
when in reality the main challenge is just:
The central message of One-Shot RLVR is simple: