Research Ideas - Spring 2026
A laundry list of some research ideas, c. Spring 2026
I’ll expand on these at some point.
Research ideas:
- Self-distillation reward models
- Self distillation without imitation (SDRM, main idea)
- SDRM-MSE-log-ratio between student/teacher tokens/trajectories
- SDRM-BT: treat teacher tokens/trajectories as preferences —> induce DPO
- Reward model scaling laws (sub idea)
- Inference-time verification via self-distillation (extension of SDRM)
- Using the reward model as a “guide”/”sampler”
- Self-distillation world model (representation learning)
- Some alternative names:
- SDRM: Self-Distillation Reward Models (current)
- SCPO: Self-Critic Policy Optimization
- STPO: Student-Teacher Policy Optimization
- Self distillation without imitation (SDRM, main idea)
- Learning from environment feedback (self-distillation and friends)
- Teacher guided self distillation for self-play (similar to idea 1.4)
- AlphaZero style exploration in language model post-training
- Right now, LM’s just sample during policy gradient, and so if we want to solve a problem, the solution must exist in distribution. However, if you think about it, all a solution really is is just a correct sequence/combination of tokens in a trajectory. Of course, this is combiniatoricalay intractable but what if we can apply some alpha-go/alpha-zero learning here? The “rules of the game” is understanding language. Without giving that understanding, we have alphago. With it, we have alpha-zero.