Papers

Latest Trending Top

Papers with Code paper May 31

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually u...

OpenAI LLM

Papers with Code paper May 31

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, n...

Benchmark

Papers with Code paper May 31

Trust Region On-Policy Distillation

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhanceme...

Papers with Code paper May 31

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Large language models are increasingly deployed as coding agents, shifting safety from individual responses to action sequences. Existing benchmarks, however, primarily assess whet...

LLM

Papers with Code paper May 31

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures. We show that this assumption can fa...

Papers with Code paper May 31

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or pr...

Papers with Code paper May 31

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

Robotic manipulation requires models that generate executable actions while anticipating and evaluating their future consequences before physical execution. We present τ_0-World Mo...

Robotics

Papers with Code paper May 30

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challeng...

Agents

Papers with Code paper May 30

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practi...

RAG

Papers with Code paper May 30

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively upda...

Papers with Code paper May 30

SDR: Set-Distance Rewards for Radiology Report Generation

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exa...

Papers with Code paper May 30

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve thes...

OpenAI Google Agents

Papers with Code paper May 30

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with us...

LLM

Papers with Code paper May 30

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source domain. The source domain typically contains semantically meaningful sa...

Papers with Code paper May 30

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehens...

Benchmark

Papers with Code paper May 30

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essent...

Papers with Code paper May 30

Confidence-Adaptive SwiGLU for Mixture-of-Experts

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed thro...

Papers with Code paper May 29

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model d...

Papers with Code paper May 29

How can embedding models bind concepts?

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with ...

Papers with Code paper May 29

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion

Accurately modeling soft boundaries, e.g., hair and defocus blur, is a fundamental challenge in stereo conversion due to the ambiguous blending of foreground and background. Existi...

Papers with Code paper May 29

Task-Focused Memorization for Multimodal Agents

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory...

Multimodal

Papers with Code paper May 29

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Trust Region On-Policy Distillation

SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

SDR: Set-Distance Rewards for Radiology Report Generation

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

Confidence-Adaptive SwiGLU for Mixture-of-Experts

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

How can embedding models bind concepts?

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion

Task-Focused Memorization for Multimodal Agents

dMoE: dLLMs with Learnable Block Experts

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

The Distillation Game: Adaptive Attacks & Efficient Defenses