My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit test could catch

I almost shipped a RAG pipeline that, on certain questions, cited exactly the right document — and...

Read Original

Related