← back

RL cheatsheet

April 14, 2026

This is an ongoing collection of index card style notes/questions on RL. It is designed to be reviewed as one would use flashcards when studying for a test.

How do you mitigate off-policyness in async rl?

Why did GRPO get rid of the value function?