Papers

Latest Trending Top

Papers with Code paper 3d ago

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative...

Multimodal

Papers with Code paper 3d ago

LLM Agents Can See Code Repositories

Coding agents powered by large language models have demonstrated strong performance on software engineering tasks. Yet most agents consume repositories almost entirely as text, whi...

LLM

Papers with Code paper 3d ago

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pr...

Multimodal Robotics

Papers with Code paper 3d ago

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet to...

Papers with Code paper 3d ago

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world ...

Papers with Code paper 3d ago

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on ...

Benchmark

Papers with Code paper 3d ago

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-im...

Papers with Code paper 3d ago

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short cli...

Papers with Code paper 3d ago

VISTA: View-Consistent Self-Verified Training for GUI Grounding

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult ...

Papers with Code paper 4d ago

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limi...

Robotics

Papers with Code paper 4d ago

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and itera...

Papers with Code paper 4d ago

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their archi...

Agents

Papers with Code paper 4d ago

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabili...

Papers with Code paper 4d ago

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read lit...

Multimodal

Papers with Code paper 4d ago

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. ...

Papers with Code paper 4d ago

Surflo: Consistent 3D Surface Flow Model with Global State

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this...

Papers with Code paper 4d ago

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-aug...

Agents

Papers with Code paper 4d ago

MiniMax Sparse Attention

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointl...

Papers with Code paper 4d ago

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the in...

Papers with Code paper 4d ago

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by t...

LLM

Papers with Code paper 4d ago

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

LLM Agents Can See Code Repositories

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

VISTA: View-Consistent Self-Verified Training for GUI Grounding

WEAVER, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

InterleaveThinker: Reinforcing Agentic Interleaved Generation

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

Surflo: Consistent 3D Surface Flow Model with Global State

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

MiniMax Sparse Attention

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Avatar V: Scaling Video-Reference Avatar Video Generation

μ_0: A Scalable 3D Interaction-Trace World Model

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages