How do you build an expert-grade research QA benchmark without hand-authoring a single question?Mine the survey articles...

How do you build an expert-grade research QA benchmark without hand-authoring a single question?Mine the survey articles. Their section structure already encodes what a field asks and what a complete answer must contain, so a new benchmark distills 21K queries and grading rubrics from surveys across 75 fields. It doubles as a stress test: even the best system tops out at 75% rubric coverage, fully addressing under 11% of needed citations.https://benjaminhan.net/posts/20260625-researchqa/?utm_source=mastodon&utm_medium=social#LLMs #Evaluation #AI

Read Original

Related