← back

Desirable Ingredients of an RLVR algorithm

April 1, 2026

When thinking about designing a new RLVR algorithm to learn from verifiable rewards, there are certain properties and ingredients that we would ideally want to have satisfied. In this particular write-up, we list those ingredients.

1. Dense Reward Signal

The problem: Binary pass/fail is maximally sparse. A solution that fails one edge case gets the same reward as total nonsense. When all rollouts pass or all fail, the gradient is exactly zero.

2. On-Policy Training

3. Credit Assignment Beyond Sequence-Level

4. Resistance to Reward Hacking

5. Exploration / Support Expansion

6. No (or Minimal) External Dependencies

7. Efficient Use of Environment Feedback

8. Generalization Beyond Verifiable Domains