Research Ideas - Spring 2026

I’ll expand on these at some point.

Research ideas:

Self-distillation reward models
1. Self distillation without imitation (SDRM, main idea)
  1. SDRM-MSE-log-ratio between student/teacher tokens/trajectories
  2. SDRM-BT: treat teacher tokens/trajectories as preferences —> induce DPO
2. Reward model scaling laws (sub idea)
3. Inference-time verification via self-distillation (extension of SDRM)
4. Using the reward model as a “guide”/”sampler”
5. Self-distillation world model (representation learning)
  - Some alternative names:
  - SDRM: Self-Distillation Reward Models (current)
  - SCPO: Self-Critic Policy Optimization
  - STPO: Student-Teacher Policy Optimization
Learning from environment feedback (self-distillation and friends)
Teacher guided self distillation for self-play (similar to idea 1.4)
AlphaZero style exploration in language model post-training
1. Right now, LM’s just sample during policy gradient, and so if we want to solve a problem, the solution must exist in distribution. However, if you think about it, all a solution really is is just a correct sequence/combination of tokens in a trajectory. Of course, this is combiniatoricalay intractable but what if we can apply some alpha-go/alpha-zero learning here? The “rules of the game” is understanding language. Without giving that understanding, we have alphago. With it, we have alpha-zero.