We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"

We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of...

Read Original

Related