[3]
K. Ding, “HDPO: Hybrid distillation policy optimization via privileged self-distillation,” arXiv preprint arXiv:2603.23871, 2026.
[4]
W. Yang, W. Liu, R. Xie, K. Yang, S. Yang, and Y. Lin, “Learning beyond teacher: Generalized on-policy distillation with reward extrapolation,” arXiv preprint arXiv:2602.12125, 2026.