Papers with Code paper May 28

ESPO: Early-Stopping Proximal Policy Optimization

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum hor...