I shipped a RAG chatbot without measurement, then built a proper eval harness. Hit@1 went from 60% to 80%, hallucination dropped from 41% to 28% and two metrics still fail. Here's the whole story.
Building a RAG Evaluation Harness That Actually Catches Problems