Self-Critic Policy Optimization
Transforming binary verifier signals into dense, on-policy rewards.
RL post-training for large language models in verifiable domains — math,
code, formal proofs — has converged on a remarkably simple recipe: roll
out the policy, hand the rollout to a checker, and use the resulting
binary reward to drive a group-relative objective .
GRPO works, scales, and is the de facto standard. But it leaves nearly
every signal the environment gives us on the table. This note proposes
Self-Critic Policy Optimization (SCPO): a family of methods that
re-conditions the same policy on hindsight to turn a one-bit verifier
signal into a dense, on-policy reward — without training a separate
reward model and without imitating a teacher.
How do you transform a binary verifier signal into a dense, on-policy
reward — using only the policy itself?
Where we are
Group Relative Policy Optimization
dropped the critic, the value function, and the learned reward model — all
of which are unstable at LLM scale — and replaced them with a single trick:
normalize advantages within a group of rollouts against an outcome
reward \(r \in \{0, 1\}\). It is simple, stable, and the entire RLVR
ecosystem is built on top of it.
The trade-off. Killing the critic means the only supervision left is
a single bit per rollout. Process quality, near-misses, trajectory structure,
compiler errors, partial test passes, proof-checker traces — everything
else the environment hands us is compressed to \(0/1\) and discarded.
The 1-bit problem
What goes wrong when the only signal is \(r \in \{0, 1\}\)?
-
Advantage collapse. If every rollout in a group scores 0 (or every
one scores 1), the group-relative advantage is identically zero. No
gradient, no learning.
-
Sparsity. A 300-token chain-of-thought receives a single scalar.
The model cannot tell which part of the trajectory helped.
-
Near-miss equals catastrophic failure. A solution that was correct
except for an arithmetic slip looks identical, gradient-wise, to incoherent
output.
-
Environment feedback discarded. Compiler errors, partial test
passes, proof-checker traces, all collapsed to \(0/1\).
Figure 1. Advantage collapse under a binary verifier. When all
rollouts in a group fail (or all succeed), the group-relative advantage
is identically zero. The hardest problems — exactly the ones we'd most
like to learn from — produce no gradient at all.
The self-distillation wave
A cluster of papers in early 2026 — OPSD, SDPO, SDFT, and several others
— converged on a
shared recipe within weeks of one another:
-
Student pass. The policy generates a rollout normally, conditioned
on the prompt \(x\) alone.
-
Teacher pass. The same policy re-runs with privileged
context — the verifier result \(r\), the execution trace, environment
feedback \(\phi\).
-
Distill teacher into student via a KL objective.
The teacher is strictly better than the student because it has hindsight.
Distillation gives token-level supervision without an external PRM. The
rest of this note builds on that two-pass architecture but argues the
distillation step is the wrong way to use it.
Why imitation isn't the right abstraction. A student who takes a
math test gets graded; the teacher does not solve the problem
for them. Imitation removes the need to reason. Concretely: KL-matching
a hindsight-conditioned teacher has been shown to degrade
reasoning by collapsing the
"aha moment" distribution. Reverse-KL also concentrates probability mass
on the teacher's preferred modes, hurting exploration.
Reframing: teacher as evaluator, not generator
Keep the two-pass architecture. Drop the KL-to-teacher objective. Use the
teacher pass purely to produce a rich scalar evaluation
\(\tilde{r}(\tau)\) of the student's rollout, then plug \(\tilde{r}\) into
a GRPO-style update. The teacher becomes a scoring function we get for
free from the same model, conditioned on privileged
hindsightThis is the within-policy, privileged-information sibling
of Hindsight Experience Replay .
HER relabels failed trajectories with achieved goals; we relabel the
reward using a hindsight-conditioned forward pass of the same
policy..
Figure 2. The supervision shift. GRPO sees a degenerate two-spike
distribution over reward; SCPO uses the teacher pass to produce a
continuous-valued \(\tilde{r}(\tau)\) that distinguishes a near-miss from
incoherent output and a clever solution from a lucky one.
The two-pass framework
Concretely, one optimization step looks like this:
-
First pass. Policy \(\pi_\theta\) generates rollout \(\tau\) given
prompt \(x\).
-
Verifier / environment. Compute outcome \(r \in \{0, 1\}\) and any
extra feedback \(\phi(\tau)\) — trace, test results, error messages.
-
Second pass (teacher). Re-condition \(\pi_\theta\) on
\((x, \tau, r, \phi)\): the rollout plus hindsight.
-
Self-critic mechanism. Extract a richer scalar
\(\tilde{r}(\tau) \in \mathbb{R}\) from the teacher pass.
-
Policy update. Optimize \(\pi_\theta\) on first-pass rollouts using
\(\tilde{r}\) in a GRPO-style objective:
$$
\mathcal{L}_{\text{SCPO}}(\theta) \;=\;
-\,\mathbb{E}_{\tau \sim \pi_\theta}\!\left[
\sum_{t=0}^{|\tau|}
\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot
\widehat{A}\bigl(\tilde{r}(\tau)\bigr)
\right],
$$
where \(\widehat{A}(\cdot)\) is the group-relative advantage computed from
the dense scalar \(\tilde{r}\) rather than the one-bit \(r\). The five
variants below all instantiate step (4) — extracting \(\tilde{r}\) — but
differ in where the scalar comes from.
Figure 3. The two-pass SCPO loop. The same policy \(\pi_\theta\)
runs once to generate a rollout, then again with the verifier result and
environment feedback as privileged context. The second pass produces a
dense scalar \(\tilde{r}\) used to compute the group-relative advantage.
No external reward model, no teacher to imitate.
Five ways to extract the scalar
Step (4) is the design space. We see at least five candidate mechanisms,
none of which are obviously dominant. The table below sketches the
trade-off; subsections expand each.
| Variant |
Mechanism |
Params |
Risk |
| A — head |
Scalar head hψ reads teacher hidden state |
~10⁶ |
Needs an auxiliary objective |
| B — operator |
Logit divergence / logprob ratio / hidden-state probe |
0 |
May capture base-model artifacts, not task quality |
| C — judge |
Teacher generates pairwise judgments, parsed via Bradley–Terry |
0 |
Calibration depends on prompting |
| D — IRL |
Discriminator on teacher vs student rollouts (GAIL-style) |
~10⁷ |
Adversarial training instability |
| E — online critic |
Separate model trained online from teacher signals |
10⁸–10¹⁰ |
Cost; needs careful update schedule |
Variant A — self-critic head
Add a small scalar head \(h_\psi\) on top of the policy. During the teacher
pass, \(h_\psi\) reads the hidden state and outputs
\(\tilde{r}(\tau) \in \mathbb{R}\). Trained jointly — match \(\tilde{r}\)
to the binary \(r\), but regularized by the richer teacher context. Tiny
parameter overhead; on-policy by construction.
Variant B — implicit critic from the teacher pass
Zero new parameters. Derive \(\tilde{r}\) from quantities the teacher
forward pass already computes:
- Teacher–student logit divergence over \(\tau\) — where does hindsight sharpen the distribution?
- Teacher log-probability of \(\tau\) conditioned on \((r, \phi)\).
- Hidden-state geometry: last-layer norm, probe-based score.
- DPO-style preference framing : teacher's logprob ratio as implicit reward.
Variant C — LLM-as-judge teacher pass
Prompt the teacher pass to produce a reward in natural language,
then parse it. Pairwise comparisons across a group of \(k\) rollouts
(\(\binom{k}{2}\) pairs) aggregated via Bradley–Terry give better calibration
than pointwise scoring . The
open scaling question: does more compute on the evaluator side — longer
teacher chain-of-thought, more pairwise comparisons — translate into a
better policy?
Variant D — IRL / GAIL-style from teacher rollouts
Frame this as inverse RL without external expert demonstrations.
The teacher pass produces its own rollouts — effectively the model's "best
guess with hindsight." Treat those as expert demonstrations, train a
discriminator \(D_\phi\) to distinguish teacher vs. student
, and use \(\log D_\phi\) as the reward
for the GRPO update on the student.
Variant E — separate online critic
Reintroduce a critic, but learn it online alongside the policy so
it cannot go stale. Trained on teacher-pass signals as labels; compatible
with any of A–D as the labelling source. The capacity is a knob — a tiny
head-on-policy (\(\sim\!10^6\) parameters), a separate small model
(\(\sim\!10^8\)), or a full reward model (\(\sim\!10^{10}\)) — which makes
a clean reward-model-scaling experiment fall out for free.
A potential scaling law
Hypothesis. As we scale compute spent on the reward side —
teacher CoT length, number of pairwise comparisons, critic size — policy
performance improves in a predictable way. A reward-model scaling law
for RLVR.
Why this matters. If a model thinks for days on a hard problem, the
1-bit verifier is absurdly sparse. We need evaluators whose expressiveness
scales with the reasoning effort. A scaling law also gives a
principled story for compute allocation between generation and evaluation.
What we hope to see
- Reward-hacking resistance. An online, teacher-grounded critic should be harder to game than a static PRM.
- Mode diversity. Pass@\(k\) goes up, not just pass@1.
- Extended reasoning in the teacher pass. The self-teacher naturally spends more tokens when hindsight is informative.
- Performance scales with policy size. The teacher gets better as the policy gets better — a positive feedback loop, not a stale critic.
Experimental plan (sketch)
Baselines. GRPO as the
primary comparison. The Jan 2026 self-distillation wave (OPSD, SDPO, SDFT,
SD-Zero, HDPO, π-Distill, FIPO, ERL)
. Classical RL
references: actor-critic, A2C / A3C, asymmetric actor-critic, PPO
, DPO
, and LLM-as-a-judge applied
separately (not in a teacher pass).
Benchmarks. GSM8K (math word problems), MBPP+ (Python coding),
LiveCodeBench v6 (contamination-resistant code).
Metrics. pass@1 (primary quality), pass@\(k\) (exploration / diversity
— central to our claim), training-time sample efficiency, reward-hacking
stress tests with adversarial verifier pairs, and compute-in → policy-out
scaling curves.
Ablations. Self-critic parameterization (head / operator / separate
critic); reward source (logits / logprobs / hidden states / learned head);
granularity (outcome / span / token); teacher privilege (verifier result
only / + trace / + test cases / + errors); pairwise vs. pointwise judging;
teacher compute scaling; policy model size.
Initial results
Numbers below are projected targets — placeholders that anchor
the experimental plan above. The baseline column reflects published GRPO
results at our planned model scale; SCPO entries are the smallest
improvements we'd consider publishable rather than
predicted. The pass@8 column is the headline metric for our
diversity / exploration claim.
Figure 4. Sample training curves (placeholder). Replace with the
actual plot once a run lands; the body-width <figure
class="l-body"> wrapper keeps the image inside the reading
column.
| Method |
GSM8K p@1 |
MBPP+ p@1 |
LCB-v6 p@1 |
GSM8K p@8 |
Hack ∆ |
| GRPO baseline |
78.4 |
64.2 |
22.1 |
87.5 |
— |
| Self-distill (best) |
79.6 |
65.4 |
23.0 |
87.9 |
+1.4 |
| SCPO-A (head) |
80.5 |
66.7 |
24.0 |
89.2 |
+0.6 |
| SCPO-B (operator) |
80.1 |
66.1 |
23.6 |
88.8 |
+0.5 |
| SCPO-C (judge, pairwise) |
81.3 |
67.5 |
24.8 |
90.4 |
+0.4 |
Table 1. Projected targets at 7B-class scale, single-checkpoint,
matched-compute. Reward hack ∆: change in pass@1 on an
adversarial verifier-pair test set (lower is better). Numbers are
illustrative anchors — actual values pending training runs.
Open questions
-
Granularity. Token-level (dense but noisy), outcome-level (clean
but sparse), or segment / step / span? Likely an empirical question per
domain.
-
Parameterization. Pure mathematical operator on the teacher pass
(no new parameters), small head on the policy, or separate critic /
reward model?
-
Reward source. Learned head, raw logits, logprobs, hidden states,
teacher–student divergence, or a mixture? An ablation study candidate.
-
Stretch. Does the self-critic scalar transfer to non-verifiable
domains? Can we detect reward hacking from teacher–verifier disagreement?
Summary
The right abstraction is not imitate a teacher, but
use the teacher (same policy, privileged context) as an evaluator
that turns the 1-bit verifier signal into a dense, calibrated, on-policy
reward. Five concrete mechanisms instantiate this idea; an experimental
plan and a candidate scaling law are above. The smallest version drops in
as a step-(4) replacement inside any existing GRPO pipeline.