We fixed the worst prompt variant. It got better. That doesn't mean the fix worked.

A pattern I've seen on more than one team: weekly eval run finishes, someone sorts the leaderboard,...

Read Original

Related