RL Research Notebook

Rohan Sikand

This page documents my journey doing reinforcement learning research and a place to gather all my thoughts. In it, I link posts to several things like notes, research ideas, open problems I’m thinking about solving, experimentation logs, results etc.

We focus specifically on post-training foundation models with RL.

Note: this notebook is meant to be a rough draft of thoughts and experiments… not polished final writeups.

Some interests:

long-horizon reasoning
online reward learning
self distillation
reward modeling and reward shaping
Continual learning
Recursive self-improvement
RL for non-verifiable domains

Writeups

Open problems in RL post-training — March 31 2026
Laundry list of ideas - April 2026
Desirable Ingredients of a post-training RL algorithm — April 1 2026
Why does RL not forget when SFT does? — April 4 2026
Why does Reverse KL encourage mode collapse in RL? — April 4 2026
What are the problems with self-distillation? — April 4 2026
Why can’t RL solve problems not in the base model’s support? — April 4 2026
KL-constrained RL — April 4 2026
Idea: Toward Self-Critic Policy Optimization (SCPO) for RL Post-Training — April 12 2026
What are Neural Thickets — April 12 2026
Idea: One-Step is all you Need (one-step rl) — April 12 2026
Problems not yet solved — April 12 2026
The rl data and fine-tuning market — April 13 2026
async rl — April 14 2026
RL cheatsheet — April 14 2026

Experiments

Policy gradient on GridWorld — March 27 2026