#Benchmark | AI Hub

Papers with Code paper 5d ago

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

As large language models and harness frameworks continue to advance, agents operating in terminals are increasingly capable of performing a broader range of general computer-use ta...

Benchmark

21

Papers with Code paper 5d ago

Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events i...

Benchmark

21

Mastodon discussion 5d ago

How do you build an expert-grade research QA benchmark without hand-authoring a single question?Mine the survey articles...

How do you build an expert-grade research QA benchmark without hand-authoring a single question?Mine the survey articles. Their section structure already encodes what a field asks ...

Benchmark

9

Papers with Code paper 6d ago

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling ea...

Benchmark

21

Mastodon discussion 6d ago

【FFASRリーダーボードのご紹介：実世界におけるASRのベンチマーク】https://huggingface.co/blog/ffasr-leaderboard※AI生成の自動投稿（見出し＋リンク）#AI #生成AI #LLM #AIGe...

【FFASRリーダーボードのご紹介：実世界におけるASRのベンチマーク】https://huggingface.co/blog/ffasr-leaderboard※AI生成の自動投稿（見出し＋リンク）#AI #生成AI #LLM #AIGenerated

Hugging Face LLM Benchmark

9

Dev.to tutorial 6d ago

Stratagems #1: Mark Johnson Walked Into an AI Audit. The Benchmark Had Everything Figured Out — Except the Truth.

Complete preparation breeds complacency. What is seen every day no longer raises suspicion. The...

Benchmark

38

Dev.to tutorial 6d ago

My eval harness paid for itself on the first run: 0.57 0.96, two bugs no unit test could catch

I almost shipped a RAG pipeline that, on certain questions, cited exactly the right document — and...

Benchmark

12

GNews news Jun 24

Zorgm Pro achieves 98.5% on NEET PG benchmark with source

Mumbai (Maharashtra) [India], June 24: In a major breakthrough for clinical artificial intelligence, the LaennecAI Clinical Research Team announced today that Zorgm Pro, an educati...

Benchmark

18

AI Blogs (RSS) news Jun 24

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Benchmark

24

Papers with Code paper Jun 24

The Galaxy's Guide to the Tokenizer: A Benchmark for Scientific Foundation Models

Tokenization is central to adapting scientific data for transformer-based foundation models, yet its impact on learned representations remains poorly understood. We compare four to...

Benchmark

21

YouTube video Jun 23

NVIDIA Blackwell crushes every AI training benchmark #ai #nvidia #blackwell #ainews #technews #tech

NVIDIA Benchmark

54

Mastodon discussion Jun 23

🧠 Researchers have developed the first benchmark for measuring AI performance in marketing tasks. The benchmark enables ...

🧠 Researchers have developed the first benchmark for measuring AI performance in marketing tasks. The benchmark enables systematic evaluation of how well AI systems handle marketin...

Benchmark

9

Dev.to tutorial Jun 23

Eval-Driven Agent Development: How I Stopped Tuning Prompts on Vibes

Series context: This is a follow-up to How I Automate Parts of My SDLC with AI Agents. Earlier posts...

Benchmark

12

Dev.to tutorial Jun 23

An AI Feature Has No "Tests Pass" Moment. So I Write the Eval First.

I was building an "Ask This Book" feature: readers can ask questions about a book while they're...

Benchmark

12

GitHub Trending repo Jun 23

Emmimal/context-graph-benchmark: A pure-Python structured memory benchmark for multi-agent LLM systems — context graph vs vector RAG vs raw history dump, five scenarios, 18 graded queries, zero API calls.

A pure-Python structured memory benchmark for multi-agent LLM systems — context graph vs vector RAG vs raw history dump, five scenarios, 18 graded queries, zero API calls.

LLM RAG Benchmark

45

Dev.to tutorial Jun 23

I built a Rust entropy monitor to route LLM inference — here's what the benchmark showed

Frontier LLM inference is expensive. I wanted to see how far a 4B local model could go before needing...

LLM Benchmark

12

Mastodon discussion Jun 23

🧠 #GLM 5.2 ha raggiunto il 44% di pass@1 su #DeepSWE, diventando il primo modello open-source nella leaderboard.👉 I dett...

🧠 #GLM 5.2 ha raggiunto il 44% di pass@1 su #DeepSWE, diventando il primo modello open-source nella leaderboard.👉 I dettagli: https://www.linkedin.com/posts/alessiopomaro_glm-deeps...

Open Source Benchmark

9

Papers with Code paper Jun 23

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Text-to-image (T2I) generation models have achieved remarkable progress in producing visually realistic images from natural language prompts. Yet it remains unclear whether their s...

Image Generation Benchmark

21

Papers with Code paper Jun 23

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating spa...

Agents Benchmark

21

Hacker News discussion Jun 23

Why eval startups fail (2025)

Benchmark

66

Mastodon discussion Jun 22

RELEASE!!! Red Team AI Benchmark v2.0: From 12 Questions to 60 — A Technical Deep Dive - A major evolution in LLM offens...

RELEASE!!! Red Team AI Benchmark v2.0: From 12 Questions to 60 — A Technical Deep Dive - A major evolution in LLM offensive-security evaluation, built in collaboration with POXEK A...

LLM Benchmark Safety/Alignment

18

Dev.to tutorial Jun 22

Red Team AI Benchmark v2.0: From 12 Questions to 60 — A Technical Deep Dive

A major evolution in LLM offensive-security evaluation, built in collaboration with POXEK...

Benchmark Safety/Alignment

20

Papers with Code paper Jun 22

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

With the rapid spread of retrieval-augmented generation and semantic search, choosing the right embedding and retrieval configuration is increasingly hard. Large retrieval benchmar...

Benchmark

21

Mastodon discussion Jun 22

📢 Benchmark CTI : Fable 5 d'Anthropic jugé contre-productif pour les défenseurs cyber📝 ## 🔍 ContextePublié le 17 juin 20...

📢 Benchmark CTI : Fable 5 d'Anthropic jugé contre-productif pour les défenseurs cyber📝 ## 🔍 ContextePublié le 17 juin 2026 par Graphistry sur leur blog officiel, cet article consti...

Anthropic Benchmark

9