Papers with Code paper Jun 2

Self-Distilled Policy Gradient

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward ...