More eval traces will not stabilize your kappa. Stratify the ones you have

TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...

Read Original

Related