#Benchmark | AI Hub

Mastodon discussion Jun 22

📢 Benchmark CTI : Fable 5 d'Anthropic jugé contre-productif pour les défenseurs cyber📝 ## 🔍 ContextePublié le 17 juin 20...

📢 Benchmark CTI : Fable 5 d'Anthropic jugé contre-productif pour les défenseurs cyber📝 ## 🔍 ContextePublié le 17 juin 2026 par Graphistry sur leur blog officiel, cet article consti...

Anthropic Benchmark

9

Dev.to tutorial Jun 21

Agent = Model x Harness: Your Eval Layer Is Part of the Agent, Not a Tool Beside It

There's a formula I keep coming back to when people ask why their slick demo agent falls apart in...

Benchmark

20

GitHub Trending repo Jun 19

briliwar0/glm-5.2-free-desktop-app: glm 5.2 free z.ai llm chatbot free api access local llm unsloth dynamic gguf llama.cpp ollama run coding assistant autonomous agent zcode opencode cloudflare workers ai hugging face model weights 1m context window long horizon tasks token bypass benchmark swe bench pro terminal bench setup installation windows macos linux

glm 5.2 free z.ai llm chatbot free api access local llm unsloth dynamic gguf llama.cpp ollama run coding assistant autonomous agent zcode opencode cloudflare workers ai hugging fac...

Hugging Face LLM Agents

60

Mastodon discussion Jun 19

Gemini 3.5 Flash delude sul benchmark Android: superato da modelli precedenti e con costi più alti del previsto Google h...

Gemini 3.5 Flash delude sul benchmark Android: superato da modelli precedenti e con costi più alti del previsto Google ha aggiornato i risultati di Android Bench, il benchmark dedi...

OpenAI Google Benchmark

24

Mastodon discussion Jun 19

Najnowszy benchmark AA-Briefcase dowodzi, że sztuczna inteligencja wciąż nie radzi sobie ze złożonymi zadaniami biurowym...

Najnowszy benchmark AA-Briefcase dowodzi, że sztuczna inteligencja wciąż nie radzi sobie ze złożonymi zadaniami biurowymi. Najlepsze modele poprawnie rozwiązują zaledwie 3% wieloet...

Benchmark

9

Mastodon discussion Jun 19

RT @NeoAIForecast: Fabels größter Return-Benchmark wird nicht Reasoning oder Coding sein. Es ist das Überleben der erste...

RT @NeoAIForecast: Fabels größter Return-Benchmark wird nicht Reasoning oder Coding sein. Es ist das Überleben der ersten 24 Stunden, bevor Pliny es wieder jailbreakt. Pliny the Li...

Benchmark

18

Dev.to tutorial Jun 19

DeepSeek V4 Pro vs GPT-4o: Real Benchmark Comparison (June 2026)

DeepSeek V4 Pro vs GPT-4o: Real Benchmark Comparison (June 2026) I ran both models through...

Multimodal Benchmark

25

Mastodon discussion Jun 19

🔥 We just published our Q4 local planner benchmark comparing local AI models for browser automation:• DiffusionGemma-26B...

🔥 We just published our Q4 local planner benchmark comparing local AI models for browser automation:• DiffusionGemma-26B-A4B-it: 0.35s median, 84% accuracy — fastest!• Gemma 4 12B ...

Google Benchmark

18

Mastodon discussion Jun 19

【QIMMA قِمّة ⛰: 品質第一のアラビア語LLMリーダーボード】https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard※AI生成の自動投稿（見出し＋リンク）#AI #...

【QIMMA قِمّة ⛰: 品質第一のアラビア語LLMリーダーボード】https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard※AI生成の自動投稿（見出し＋リンク）#AI #生成AI #LLM #AIGenerated

Hugging Face LLM Benchmark

18

Mastodon discussion Jun 19

【Open ASR リーダーボードに Benchmaxxer Repellant を追加】https://huggingface.co/blog/open-asr-leaderboard-private-data※AI生成の自動投稿（見出し...

【Open ASR リーダーボードに Benchmaxxer Repellant を追加】https://huggingface.co/blog/open-asr-leaderboard-private-data※AI生成の自動投稿（見出し＋リンク）#AI #生成AI #LLM #AIGenerated

Hugging Face Benchmark

18

NewsData.io news Jun 18

The Evolving Role of the Data, AI, and Analytics Executive: 2026 Benchmark Survey

The Evolving Role of the Data, AI, and Analytics Executive: 2026 Benchmark Survey - CDO Magazine

Benchmark

21

Mastodon discussion Jun 18

The leaderboard, sorted by executive and the teams underneath them, has a feature that shows users which employees have ...

The leaderboard, sorted by executive and the teams underneath them, has a feature that shows users which employees have not earned the badges. “click to see who 👀,” the leaderboard...

Benchmark

18

Mastodon discussion Jun 18

Salesforce’s Internal AI Leaderboard Has Teams Competing for Little Trophies https://fed.brid.gy/r/https://www.404media....

Salesforce’s Internal AI Leaderboard Has Teams Competing for Little Trophies https://fed.brid.gy/r/https://www.404media.co/salesforces-internal-ai-leaderboard-has-teams-competing-f...

Benchmark

24

Dev.to tutorial Jun 18

I built a Homebrew for AI skills: install flow and eval harness inside

SkillForge is a local-first OSS tool that turns a plain-English engineering need into installable SKILL.md, README.md, and config.yaml files. Six LLM providers, LLM-as-judge eval h...

Benchmark

12

Mastodon discussion Jun 18

GLM-5.2 tops the open-weights leaderboard with a score of 51, Anthropic faces export control controversy over Fable 5, a...

GLM-5.2 tops the open-weights leaderboard with a score of 51, Anthropic faces export control controversy over Fable 5, and Midjourney pivots to medical ultrasound hardware.https://...

Anthropic Benchmark

9

Dev.to tutorial Jun 18

Is AI Getting Quietly Dumber? A 24/7 Benchmark That Catches LLM Degradation

You've probably hit this before — yesterday the AI felt sharp, fixed your bug without you even...

LLM Benchmark

12

Dev.to tutorial Jun 18

Your Agent Passed Every Eval and Still Cost $4,000 a Day

Here is a failure mode nobody puts on their roadmap: the agent works. It answers correctly. It passes...

Benchmark

12

Papers with Code paper Jun 18

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional...

Benchmark

21

Papers with Code paper Jun 18

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive bench...

Benchmark

21

Dev.to tutorial Jun 18

I cut an AI agent's input tokens by 71% and quality held — here's the 66-task benchmark

I cut a coding agent's input tokens by 71% — from 5.07M down to 1.46M across a 66-task run — and...

Agents Benchmark

12

Mastodon discussion Jun 17

inferbench: download, launch & benchmark local LLM engines (llama.cpp & more) from one desktop app. Real tokens/sec on Y...

inferbench: download, launch & benchmark local LLM engines (llama.cpp & more) from one desktop app. Real tokens/sec on YOUR hardware — no invented numbers. Now serves models over M...

LLM Benchmark

18

Mastodon discussion Jun 17

🤖 Introducing LifeSciBenchIntroducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI ...

🤖 Introducing LifeSciBenchIntroducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and de...

OpenAI Benchmark

18

Dev.to tutorial Jun 17

We stopped writing eval cases by hand. Now every prod incident becomes one.

TL;DR: Hand-written eval cases test the failures you already imagined, which are never the ones that...

Benchmark

12

GitHub Trending repo Jun 17

SantanderAI/sota-stressed-datasets: Open benchmark datasets republished in stressed form to evaluate ML/LLM robustness. Curated by Santander AI Lab.

Open benchmark datasets republished in stressed form to evaluate ML/LLM robustness. Curated by Santander AI Lab.

LLM Benchmark

39