Erica (@ericavaneee)실세계 경제 협상에서 LLM 에이전트를 평가하는 3단계 벤치마크 TERMS-Bench를 공개했다. LLM-as-judge나 결과 기반 루브릭 없이, 환경 자체를 검증자로 사용한다. 프론티어 모델 중 Claude Opus 4.6이 1위, GLM 5.1이 2위로 언급됐다.https://x.com/ericavaneee/status/2055868536099381638#llm #agents #benchmark #evaluation #anthropic
Related
AI requests often keep running after users disconnect. Here’s why orphaned async work burns tokens, GPU time, and API sp...
AI requests often keep running after users disconnect. Here’s why orphaned async work burns tokens, GPU time, and API spend. https://hackernoon.com/the-hidden-cost-of-orphaned-ai-w...
#shownotes for @gamesatwork_biz #podcast e554 with are done and publication set for tomorrow on https://www.gamesatwork....
#shownotes for @gamesatwork_biz #podcast e554 with are done and publication set for tomorrow on https://www.gamesatwork.biz together with @michaelrowe01 and @andypiper Find e554 on...
There are a lot of strong feelings right now around “AI” (aka LLMs), ChatGPT, and robotaxis — fear, anxiety, confusion, ...
There are a lot of strong feelings right now around “AI” (aka LLMs), ChatGPT, and robotaxis — fear, anxiety, confusion, curiosity, and maybe a mix of the above. I found these video...