Mastodon discussion Discussions 5d ago

How do you build an expert-grade research QA benchmark without hand-authoring a single question?Mine the survey articles...

by Benjamin Han

How do you build an expert-grade research QA benchmark without hand-authoring a single question?Mine the survey articles. Their section structure already encodes what a field asks and what a complete answer must contain, so a new benchmark distills 21K queries and grading rubrics from surveys across 75 fields. It doubles as a stress test: even the best system tops out at 75% rubric coverage, fully addressing under 11% of needed citations.https://benjaminhan.net/posts/20260625-researchqa/?utm_source=mastodon&utm_medium=social#LLMs #Evaluation #AI

Read Original

Benchmark

Metadata

Account: BenjaminHan@sigmoid.social

Mastodon discussion 12m ago

Claude Opus 4.7 is free API right now. #ai #aicodingtools #tutorial https://www.youtube.com/watch?v=jw4KJx9yb3g

Mastodon discussion 13m ago

Cross-sectional AI/ML study trends, 2010-2023•📈 3106 studies reviewed; 62.8% post-2020•🏥 44.2% hospital/clinic sponsorsh...

Cross-sectional AI/ML study trends, 2010-2023•📈 3106 studies reviewed; 62.8% post-2020•🏥 44.2% hospital/clinic sponsorships•📊 Large enrollment skew: Max 13M, median 255Find more: h...

Mastodon discussion 14m ago

📰 Kyoto Xanadu official trailer, second TV commercialFalcom has released a two-minute trailer and new TV commercial for ...

📰 Kyoto Xanadu official trailer, second TV commercialFalcom has released a two-minute trailer and new TV commercial for Kyoto Xanadu, featuring new … Source📰 Source: Gematsu🔗 Link:...

How do you build an expert-grade research QA benchmark without hand-authoring a single question?Mine the survey articles...

Metadata

Related

Claude Opus 4.7 is free API right now. #ai #aicodingtools #tutorial https://www.youtube.com/watch?v=jw4KJx9yb3g

Cross-sectional AI/ML study trends, 2010-2023•📈 3106 studies reviewed; 62.8% post-2020•🏥 44.2% hospital/clinic sponsorsh...

📰 Kyoto Xanadu official trailer, second TV commercialFalcom has released a two-minute trailer and new TV commercial for ...