TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...
More eval traces will not stabilize your kappa. Stratify the ones you have
TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63...
I run RektRadar, a real-time scam-token detector for Ethereum. This is an honest build-log of one...
The cloud spent fifteen years teaching architects to think in availability zones, regional...
The hard parts of robotics are supposed to be perception, planning, and control. So why does so much...