Papers

Latest Trending Top

Papers with Code paper May 28

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcome...

Papers with Code paper May 28

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-...

LLM

Papers with Code paper May 28

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train ...

LLM

Papers with Code paper May 28

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchma...

Anthropic Benchmark

Papers with Code paper May 28

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch...

LLM

Papers with Code paper May 28

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by th...

Papers with Code paper May 28

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irr...

Multimodal

Papers with Code paper May 28

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression add...

Multimodal

Papers with Code paper May 28

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has exp...

Multimodal

Papers with Code paper May 28

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supe...

Papers with Code paper May 28

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into ...

Multimodal

Papers with Code paper May 28

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental tra...

LLM Benchmark

Papers with Code paper May 28

Colored Noise Diffusion Sampling

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures ea...

Papers with Code paper May 28

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likeli...

Papers with Code paper May 28

PhoneWorld: Scaling Phone-Use Agent Environments

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks...

Papers with Code paper May 28

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger mo...

Papers with Code paper May 28

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stal...

Multimodal

Papers with Code paper May 28

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multim...

Multimodal

Papers with Code paper May 28

GenClaw: Code-Driven Agentic Image Generation

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existin...

Agents

Papers with Code paper May 28

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains ...

Open Source

Papers with Code paper May 28

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may over...

Multimodal

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Multimodal Music Recommendation System using LLMs

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Colored Noise Diffusion Sampling

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

PhoneWorld: Scaling Phone-Use Agent Environments

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

GenClaw: Code-Driven Agentic Image Generation

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

NeuROK: Generative 4D Neural Object Kinematics

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning