Self-Critic Policy Optimization

Transforming binary verifier signals into dense, on-policy rewards.

RL post-training for large language models in verifiable domains — math, code, formal proofs — has converged on a remarkably simple recipe: roll out the policy, hand the rollout to a checker, and use the resulting binary reward to drive a group-relative objective . GRPO works, scales, and is the de facto standard. But it leaves nearly every signal the environment gives us on the table. This note proposes Self-Critic Policy Optimization (SCPO): a family of methods that re-conditions the same policy on hindsight to turn a one-bit verifier signal into a dense, on-policy reward — without training a separate reward model and without imitating a teacher.

How do you transform a binary verifier signal into a dense, on-policy reward — using only the policy itself?

Where we are

Group Relative Policy Optimization dropped the critic, the value function, and the learned reward model — all of which are unstable at LLM scale — and replaced them with a single trick: normalize advantages within a group of rollouts against an outcome reward $r \in \{0, 1\}$. It is simple, stable, and the entire RLVR ecosystem is built on top of it.

The trade-off. Killing the critic means the only supervision left is a single bit per rollout. Process quality, near-misses, trajectory structure, compiler errors, partial test passes, proof-checker traces — everything else the environment hands us is compressed to $0/1$ and discarded.

The 1-bit problem

What goes wrong when the only signal is $r \in \{0, 1\}$?

Advantage collapse. If every rollout in a group scores 0 (or every one scores 1), the group-relative advantage is identically zero. No gradient, no learning.
Sparsity. A 300-token chain-of-thought receives a single scalar. The model cannot tell which part of the trajectory helped.
Near-miss equals catastrophic failure. A solution that was correct except for an arithmetic slip looks identical, gradient-wise, to incoherent output.
Environment feedback discarded. Compiler errors, partial test passes, proof-checker traces, all collapsed to $0/1$.

Figure 1. Advantage collapse under a binary verifier. When all rollouts in a group fail (or all succeed), the group-relative advantage is identically zero. The hardest problems — exactly the ones we'd most like to learn from — produce no gradient at all.

The self-distillation wave

A cluster of papers in early 2026 — OPSD, SDPO, SDFT, and several others — converged on a shared recipe within weeks of one another:

Student pass. The policy generates a rollout normally, conditioned on the prompt $x$ alone.
Teacher pass. The same policy re-runs with privileged context — the verifier result $r$, the execution trace, environment feedback $\phi$.
Distill teacher into student via a KL objective.

The teacher is strictly better than the student because it has hindsight. Distillation gives token-level supervision without an external PRM. The rest of this note builds on that two-pass architecture but argues the distillation step is the wrong way to use it.

Why imitation isn't the right abstraction. A student who takes a math test gets graded; the teacher does not solve the problem for them. Imitation removes the need to reason. Concretely: KL-matching a hindsight-conditioned teacher has been shown to degrade reasoning by collapsing the "aha moment" distribution. Reverse-KL also concentrates probability mass on the teacher's preferred modes, hurting exploration.

Reframing: teacher as evaluator, not generator

Keep the two-pass architecture. Drop the KL-to-teacher objective. Use the teacher pass purely to produce a rich scalar evaluation $\tilde{r}(\tau)$ of the student's rollout, then plug $\tilde{r}$ into a GRPO-style update. The teacher becomes a scoring function we get for free from the same model, conditioned on privileged hindsightThis is the within-policy, privileged-information sibling of Hindsight Experience Replay . HER relabels failed trajectories with achieved goals; we relabel the reward using a hindsight-conditioned forward pass of the same policy..

Figure 2. The supervision shift. GRPO sees a degenerate two-spike distribution over reward; SCPO uses the teacher pass to produce a continuous-valued $\tilde{r}(\tau)$ that distinguishes a near-miss from incoherent output and a clever solution from a lucky one.

The two-pass framework

Concretely, one optimization step looks like this:

First pass. Policy $\pi_\theta$ generates rollout $\tau$ given prompt $x$.
Verifier / environment. Compute outcome $r \in \{0, 1\}$ and any extra feedback $\phi(\tau)$ — trace, test results, error messages.
Second pass (teacher). Re-condition $\pi_\theta$ on $(x, \tau, r, \phi)$: the rollout plus hindsight.
Self-critic mechanism. Extract a richer scalar $\tilde{r}(\tau) \in \mathbb{R}$ from the teacher pass.
Policy update. Optimize $\pi_\theta$ on first-pass rollouts using $\tilde{r}$ in a GRPO-style objective:

$$ \mathcal{L}_{\text{SCPO}}(\theta) \;=\; -\,\mathbb{E}_{\tau \sim \pi_\theta}\!\left[ \sum_{t=0}^{|\tau|} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \widehat{A}\bigl(\tilde{r}(\tau)\bigr) \right], $$

where $\widehat{A}(\cdot)$ is the group-relative advantage computed from the dense scalar $\tilde{r}$ rather than the one-bit $r$. The five variants below all instantiate step (4) — extracting $\tilde{r}$ — but differ in where the scalar comes from.

Figure 3. The two-pass SCPO loop. The same policy $\pi_\theta$ runs once to generate a rollout, then again with the verifier result and environment feedback as privileged context. The second pass produces a dense scalar $\tilde{r}$ used to compute the group-relative advantage. No external reward model, no teacher to imitate.

Five ways to extract the scalar

Step (4) is the design space. We see at least five candidate mechanisms, none of which are obviously dominant. The table below sketches the trade-off; subsections expand each.

Variant	Mechanism	Params	Risk
A — head	Scalar head h_ψ reads teacher hidden state	~10⁶	Needs an auxiliary objective
B — operator	Logit divergence / logprob ratio / hidden-state probe	0	May capture base-model artifacts, not task quality
C — judge	Teacher generates pairwise judgments, parsed via Bradley–Terry	0	Calibration depends on prompting
D — IRL	Discriminator on teacher vs student rollouts (GAIL-style)	~10⁷	Adversarial training instability
E — online critic	Separate model trained online from teacher signals	10⁸–10¹⁰	Cost; needs careful update schedule

Variant A — self-critic head

Add a small scalar head $h_\psi$ on top of the policy. During the teacher pass, $h_\psi$ reads the hidden state and outputs $\tilde{r}(\tau) \in \mathbb{R}$. Trained jointly — match $\tilde{r}$ to the binary $r$, but regularized by the richer teacher context. Tiny parameter overhead; on-policy by construction.

Variant B — implicit critic from the teacher pass

Zero new parameters. Derive $\tilde{r}$ from quantities the teacher forward pass already computes:

Teacher–student logit divergence over $\tau$ — where does hindsight sharpen the distribution?
Teacher log-probability of $\tau$ conditioned on $(r, \phi)$.
Hidden-state geometry: last-layer norm, probe-based score.
DPO-style preference framing : teacher's logprob ratio as implicit reward.

Variant C — LLM-as-judge teacher pass

Prompt the teacher pass to produce a reward in natural language, then parse it. Pairwise comparisons across a group of $k$ rollouts ($\binom{k}{2}$ pairs) aggregated via Bradley–Terry give better calibration than pointwise scoring . The open scaling question: does more compute on the evaluator side — longer teacher chain-of-thought, more pairwise comparisons — translate into a better policy?

Variant D — IRL / GAIL-style from teacher rollouts

Frame this as inverse RL without external expert demonstrations. The teacher pass produces its own rollouts — effectively the model's "best guess with hindsight." Treat those as expert demonstrations, train a discriminator $D_\phi$ to distinguish teacher vs. student , and use $\log D_\phi$ as the reward for the GRPO update on the student.

Variant E — separate online critic

Reintroduce a critic, but learn it online alongside the policy so it cannot go stale. Trained on teacher-pass signals as labels; compatible with any of A–D as the labelling source. The capacity is a knob — a tiny head-on-policy ($\sim\!10^6$ parameters), a separate small model ($\sim\!10^8$), or a full reward model ($\sim\!10^{10}$) — which makes a clean reward-model-scaling experiment fall out for free.

A potential scaling law

Hypothesis. As we scale compute spent on the reward side — teacher CoT length, number of pairwise comparisons, critic size — policy performance improves in a predictable way. A reward-model scaling law for RLVR.

Why this matters. If a model thinks for days on a hard problem, the 1-bit verifier is absurdly sparse. We need evaluators whose expressiveness scales with the reasoning effort. A scaling law also gives a principled story for compute allocation between generation and evaluation.

What we hope to see

Reward-hacking resistance. An online, teacher-grounded critic should be harder to game than a static PRM.
Mode diversity. Pass@$k$ goes up, not just pass@1.
Extended reasoning in the teacher pass. The self-teacher naturally spends more tokens when hindsight is informative.
Performance scales with policy size. The teacher gets better as the policy gets better — a positive feedback loop, not a stale critic.

Experimental plan (sketch)

Baselines. GRPO as the primary comparison. The Jan 2026 self-distillation wave (OPSD, SDPO, SDFT, SD-Zero, HDPO, π-Distill, FIPO, ERL) . Classical RL references: actor-critic, A2C / A3C, asymmetric actor-critic, PPO , DPO , and LLM-as-a-judge applied separately (not in a teacher pass).

Benchmarks. GSM8K (math word problems), MBPP+ (Python coding), LiveCodeBench v6 (contamination-resistant code). Metrics. pass@1 (primary quality), pass@$k$ (exploration / diversity — central to our claim), training-time sample efficiency, reward-hacking stress tests with adversarial verifier pairs, and compute-in → policy-out scaling curves.

Ablations. Self-critic parameterization (head / operator / separate critic); reward source (logits / logprobs / hidden states / learned head); granularity (outcome / span / token); teacher privilege (verifier result only / + trace / + test cases / + errors); pairwise vs. pointwise judging; teacher compute scaling; policy model size.

Initial results

Numbers below are projected targets — placeholders that anchor the experimental plan above. The baseline column reflects published GRPO results at our planned model scale; SCPO entries are the smallest improvements we'd consider publishable rather than predicted. The pass@8 column is the headline metric for our diversity / exploration claim.

Training curves — **Figure 4.** Sample training curves (placeholder). Replace with the actual plot once a run lands; the body-width `<figure class="l-body">` wrapper keeps the image inside the reading column.

Method	GSM8K p@1	MBPP+ p@1	LCB-v6 p@1	GSM8K p@8	Hack ∆
GRPO baseline	78.4	64.2	22.1	87.5	—
Self-distill (best)	79.6	65.4	23.0	87.9	+1.4
SCPO-A (head)	80.5	66.7	24.0	89.2	+0.6
SCPO-B (operator)	80.1	66.1	23.6	88.8	+0.5
SCPO-C (judge, pairwise)	81.3	67.5	24.8	90.4	+0.4

Table 1. Projected targets at 7B-class scale, single-checkpoint, matched-compute. Reward hack ∆: change in pass@1 on an adversarial verifier-pair test set (lower is better). Numbers are illustrative anchors — actual values pending training runs.

Open questions

Granularity. Token-level (dense but noisy), outcome-level (clean but sparse), or segment / step / span? Likely an empirical question per domain.
Parameterization. Pure mathematical operator on the teacher pass (no new parameters), small head on the policy, or separate critic / reward model?
Reward source. Learned head, raw logits, logprobs, hidden states, teacher–student divergence, or a mixture? An ablation study candidate.
Stretch. Does the self-critic scalar transfer to non-verifiable domains? Can we detect reward hacking from teacher–verifier disagreement?

Summary

The right abstraction is not imitate a teacher, but use the teacher (same policy, privileged context) as an evaluator that turns the 1-bit verifier signal into a dense, calibrated, on-policy reward. Five concrete mechanisms instantiate this idea; an experimental plan and a candidate scaling law are above. The smallest version drops in as a step-(4) replacement inside any existing GRPO pipeline.

Related work

Self-distillation wave (Jan–Apr 2026). On-Policy Self-Distillation (OPSD), Self-Distillation Policy Optimization (SDPO), Self-Distillation Fine-Tuning (SDFT), HDPO (hybrid distillation policy optimization), SD-Zero (self-revision turns binary rewards into dense supervision), SRPO (unifying GRPO and self-distillation via sample routing), and several others published within weeks of each other .

Classical RL connections. The actor-critic family (including A2C, A3C, asymmetric actor-critic) — the self-critic head is the within-policy, privileged-information version. PPO is the outer optimizer style the self-critic reward plugs into. DPO gives the preference-pair framing for Variant B. Hindsight Experience Replay is the direct spiritual ancestor of the hindsight / teacher-with-privileged-context setup. GAIL / AIRL is the backbone for Variant D. Inverse RL is the formal framing we inherit without needing external demonstrations.

Acknowledgments

Thanks to my advisor and lab mates for early discussions on the teacher-as-evaluator framing.