/// AI HUB
Dashboard News Models Tools Papers Repos Videos Companies Trending
Login

#Benchmark

327 articles tagged with Benchmark

Latest Trending
Mastodon discussion May 6

📢 HackerOne : Benchmark GPT-5.5 vs Claude Opus 4.7 vs Sonnet 4.6 pour la validation de vulnérabilités📝 📅 **Source et con...

📢 HackerOne : Benchmark GPT-5.5 vs Claude Opus 4.7 vs Sonnet 4.6 pour la validation de vulnérabilités📝 📅 **Source et contexte** : Article publié le 6 mai 2026 sur le blog HackerOne...

OpenAI Anthropic Benchmark
9
Mastodon discussion May 6

📰 DeepSeek V4 AI Tops 2026 AI Benchmarks — 98.2% on MMLU, Outperforms Llama 3DeepSeek V4 AI has emerged as a dominant fo...

📰 DeepSeek V4 AI Tops 2026 AI Benchmarks — 98.2% on MMLU, Outperforms Llama 3DeepSeek V4 AI has emerged as a dominant force in machine learning, outperforming leading models across...

Benchmark
9
Mastodon discussion May 6

📰 2026 LLM Debate Benchmark: GPT-5.5, DeepSeek V4 Pro, and GLM-5.1 Ranked — Grok 4.3 DropsThe latest LLM debate benchmar...

📰 2026 LLM Debate Benchmark: GPT-5.5, DeepSeek V4 Pro, and GLM-5.1 Ranked — Grok 4.3 DropsThe latest LLM debate benchmark reveals GPT-5.5 enters at 1574, while DeepSeek V4 Pro surg...

OpenAI xAI LLM
9
Mastodon discussion May 6

📰 Qwen 3.6 27B Quantization in 2026: IQ4_XS Delivers 98% BF16 Accuracy on 16GB VRAMA detailed benchmark of Qwen 3.6 27B ...

📰 Qwen 3.6 27B Quantization in 2026: IQ4_XS Delivers 98% BF16 Accuracy on 16GB VRAMA detailed benchmark of Qwen 3.6 27B quantizations reveals IQ4_XS as the optimal balance of accur...

Benchmark
9
Mastodon discussion May 6

【Open ASR リーダーボードに Benchmaxxer Repellant を追加】https://huggingface.co/blog/open-asr-leaderboard-private-data※AI生成の自動投稿(見出し...

【Open ASR リーダーボードに Benchmaxxer Repellant を追加】https://huggingface.co/blog/open-asr-leaderboard-private-data※AI生成の自動投稿(見出し+リンク)#AI #生成AI #LLM #AIGenerated

Hugging Face Benchmark
18
Mastodon discussion May 6

【QIMMA قِمّة ⛰: 品質第一のアラビア語LLMリーダーボード】https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard※AI生成の自動投稿(見出し+リンク)#AI #...

【QIMMA قِمّة ⛰: 品質第一のアラビア語LLMリーダーボード】https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard※AI生成の自動投稿(見出し+リンク)#AI #生成AI #LLM #AIGenerated

Hugging Face LLM Benchmark
18
Mastodon discussion May 6

🧠 Trader.ai provides a leaderboard platform where users can view the performance metrics of AI trading bots. The platfor...

🧠 Trader.ai provides a leaderboard platform where users can view the performance metrics of AI trading bots. The platform allows individuals to study and learn from the strategies ...

Benchmark
18
AI Blogs (RSS) news May 6

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Benchmark
24
Papers with Code paper May 6

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We p...

LLM Benchmark AI Hardware
21
Papers with Code paper May 6

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing bench...

Benchmark
21
Dev.to tutorial May 5

Why I spun my benchmark into its own repo (and why every dev tool with a benchmark should)

This week I shipped a benchmark for code-intelligence MCP servers and posted the results — including...

Benchmark
12
Mastodon discussion May 5

New CASIA Benchmark Exposes Fragmented Face Swapping EvaluationCASIA researchers released a face swapping survey and ben...

New CASIA Benchmark Exposes Fragmented Face Swapping EvaluationCASIA researchers released a face swapping survey and benchmark on April 27, 2026, aiming to standardize evaluation a...

Benchmark
18
Mastodon discussion May 5

ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 ModelsARMOR 2025 benchmark tests 21 LLMs against milita...

ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 ModelsARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civi...

LLM Benchmark
18
Mastodon discussion May 5

📰 Agentic AI Boosts Trip Planning Accuracy to 77.4% | Ground Truth Benchmark 2026A groundbreaking agentic AI framework a...

📰 Agentic AI Boosts Trip Planning Accuracy to 77.4% | Ground Truth Benchmark 2026A groundbreaking agentic AI framework achieves 77.4% accuracy in trip planning optimization, solvin...

Agents Benchmark
9
Dev.to tutorial May 5

The 3-Layer Eval Stack: Ground Truth, Judgment Patterns, and Feedback Loops That Compound Over Time

One of Wall Street's Best Law Firms Shipped AI Hallucinations Into Federal Court. Your Agent...

Benchmark
12
Papers with Code paper May 5

A Benchmark for Interactive World Models with a Unified Action Generation Framework

Achieving Artificial General Intelligence (AGI) requires agents that learn and interact adaptively, with interactive world models providing scalable environments for perception, re...

Benchmark
21
Papers with Code paper May 5

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks...

Benchmark
21
Dev.to tutorial May 4

How to diagnose where your RAG agent fabricates: an open-source A/B eval workflow with cross-lab blind judges

TL;DR: I caught my own RAG agent telling a customer with a severe nut allergy which dishes were...

RAG Open Source Benchmark
12
Mastodon discussion May 4

For years we’ve been pushing the most advanced AI, highlighting benchmark wins and explaining why it’s the best. People ...

For years we’ve been pushing the most advanced AI, highlighting benchmark wins and explaining why it’s the best. People listen, but real-world adoption still lags. The problem isn’...

Benchmark
18
NewsData.io news May 4

Tigo Energy Breaks Global Growth Benchmark; Boosts U.S. Energy Feature in Predict+

LOS GATOS, Calif.--(BUSINESS WIRE)--Tigo Energy, Inc. (NASDAQ: TYGO) (“Tigo” or “Company”), a leading provider of intelligent solar and energy solutions, today announced that the P...

Benchmark
21
GitHub Trending repo May 4

joshawome/chainreason: A benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks

A benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks

LLM Benchmark
66
Papers with Code paper May 4

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

Robotic systems that interact with the physical world must reason about kinematic and dynamic constraints imposed by their own embodiment, their environment, and the task at hand. ...

Benchmark Robotics
21
Mastodon discussion May 3

Do LLMs understand coordinates?: ’A new #benchmark called #GPSBench evaluates 14 #LLM-s across 17 coordinate manipulatio...

Do LLMs understand coordinates?: ’A new #benchmark called #GPSBench evaluates 14 #LLM-s across 17 coordinate manipulation and reasoning tasks and finds that models handle real-worl...

LLM Benchmark
9
Mastodon discussion May 3

📰 China's AI Advancement in 2026: DeepSeek Shatters US Benchmark ClaimsChina's AI progress, led by DeepSeek, contradicts...

📰 China's AI Advancement in 2026: DeepSeek Shatters US Benchmark ClaimsChina's AI progress, led by DeepSeek, contradicts recent US government claims of an eight-month lag. Governme...

Benchmark
9
« Previous Page 4 of 14 (327 items) Next »
AI Hub // AI Intelligence Platform // LIVE FEED // Impressum // Datenschutz © 2026
0 new articles available