← back

KL-constrained RL

April 4, 2026

See [1], [2], [3], [4].

References

[1]
L. C. Tiao, Density Ratio Estimation for KL Divergence Minimization between Implicit Distributions,” tiao.io, 2018, Available: https://tiao.io/post/density-ratio-estimation-for-kl-divergence-minimization-between-implicit-distributions/
[2]
X. Wang and S. Zhang, “Choosing KL estimators in RL: From value unbiasedness to gradient correctness.” Dec. 01, 2025. Available: https://xihuai18.github.io/reinforcement-learning/2025/12/01/kl-estimators-en.html
[3]
K. Ding, “HDPO: Hybrid distillation policy optimization via privileged self-distillation,” arXiv preprint arXiv:2603.23871, 2026.
[4]
W. Yang, W. Liu, R. Xie, K. Yang, S. Yang, and Y. Lin, “Learning beyond teacher: Generalized on-policy distillation with reward extrapolation,” arXiv preprint arXiv:2602.12125, 2026.