Papers

Latest Trending Top

Papers with Code paper 1d ago

ReDesign: Recovering Editable Design Structures from Images via Agentic Decomposition

Recovering an editable design file from a raster image is a common and costly bottleneck in modern design workflows, yet remains challenging since editability depends on recovering...

Agents

Papers with Code paper 1d ago

HiFi-UMI: Learning Deployable Manipulation Policies from High-Fidelity UMI Data Alone

Learning deployable manipulation policies is bottlenecked by the scarcity of data that is both high-fidelity and scalable. Real-robot teleoperation is accurate but costly to scale;...

Robotics

Papers with Code paper 1d ago

Visual prompt engineering for video models

In the age of foundation models, a model is only as good as its prompt. For this reason, prompt engineering has become an essential technique for improving language model performan...

Papers with Code paper 1d ago

OmniDelta: Skill-Driven Budget Allocation for Token Compression in OmniLLMs

Emerging Omni-modal Large Language Models (OmniLLMs) enable unified understanding of text, audio, and video, but their long audio-video token sequences introduce substantial memory...

Papers with Code paper 1d ago

Wonder: Video World Model Done Better

We present Wonder, a general-purpose video world model for real-time, camera-controllable world exploration. Given an image or a conditional video, Wonder constructs a playable wor...

Papers with Code paper 1d ago

Shieldstral

We introduce Shieldstral, a 3B-parameter policy-adaptive multimodal safety classifier that matches or outperforms models nearly 7times its size on text safety benchmarks and sets a...

Papers with Code paper 2d ago

DecoupleMix: Decoupled Ratio Search and Convex Allocation for Scalable VLM Data Recipes

While data curation for Vision Language Models (VLMs) is increasingly active, public practice for constructing pretraining mixtures remains largely heuristic: practitioners stack d...

Papers with Code paper 2d ago

From Proprietary to Open-Source: Bridging the Distribution Gap via Multi-Agent Protocol Distillation in Agentic Search

Agentic search enables large language models to solve knowledge-intensive tasks by interleaving multi-step reasoning with retrieval, yet optimizing this with outcome-based reinforc...

Agents Open Source

Papers with Code paper 2d ago

ClinFusion: A Vision-Centric Multimodal LLM System for Holistic Medical Understanding

Multimodal large language models (MLLMs) hold immense potential to revolutionize clinical practice, yet deploying them in the medical domain is fundamentally a vision-centric chall...

LLM Multimodal

Papers with Code paper 2d ago

Data Pyramid for Embodied Manipulation

Multimodal foundation models learned to see and to speak by consuming the whole internet. Embodied agents admit no such shortcut, since they require data that couple observations w...

Robotics

Papers with Code paper 2d ago

WorldDiT: A Unified Diffusion Architecture for World and Action Modeling

Many recent robot policies pursue stronger control by using large pretrained vision-language models (VLMs) as the action backbone. We introduce WorldDiT, a unified diffusion transf...

Papers with Code paper 2d ago

Rethinking Classifier-Free Guidance in On-Policy Diffusion Distillation

On-policy distillation (OPD) adapts diffusion models by querying a teacher along trajectories generated by the current student, but how it should behave under classifier-free guida...

Papers with Code paper 2d ago

Sol-Attn: Accelerating Video Generation Inference via On-the-Fly Attention Sparsification

Diffusion transformers are essential for high-fidelity video generation, but long token sequences make attention a dominant inference bottleneck. Training-free dynamic sparse atten...

Papers with Code paper 2d ago

FilmBench: A Film-Grade Benchmark for Cinematic Video Generation

Progress in video generation keeps narrowing the visual gap between AI-generated and professionally produced footage, yet most benchmarks still draw prompts from web sources or LLM...

Benchmark

Papers with Code paper 2d ago

Kimi K3: Open Frontier Intelligence

We introduce Kimi K3, a 2.8T parameter Mixture-of-Experts model with 104 billion activated parameters, native vision capabilities, and a 1-million-token context window. Kimi K3 is ...

Papers with Code paper 2d ago

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

Standard vision-language models (VLMs) suffer from Moravec's paradox: they excel at complex offline visual reasoning but struggle with simple streaming perception tasks and process...

LLM Multimodal

Papers with Code paper 2d ago

A New Role for Relevance: Guiding Corpus Interaction in Agentic Search

Relevance is a query-dependent estimate of whether a document or excerpt contains useful evidence. Existing retrieval agents use relevance to select top-k content, but document rel...

Agents

Papers with Code paper 2d ago

PerceptionBench: Evaluating Atomic Visual Perception in Multimodal Large Language Models

We introduce PerceptionBench, a benchmark specifically designed to evaluate the atomic visual perception capabilities of Multimodal Large Language Models (MLLMs). Existing benchmar...

Multimodal

Papers with Code paper 2d ago

Keep It InMind: Benchmarking the Implicit-Association Blind Spot in Agent Memory

Long-term memory systems store what a user says in an external store and retrieve it when a related query arrives. This interface rests on an assumption so natural that it is rarel...

Papers with Code paper 2d ago

Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

The alignment of Small Language Models (SLMs) in the 70--500M parameter range using reinforcement learning is often considered unstable, though the underlying failure mechanisms ha...

LLM

Papers with Code paper 2d ago

The Physics of Multi-Turn Long-Horizon Planning: From Pre-training to Post-training via Single- and Multi-Teacher On-Policy Agentic Distillation

Multi-turn long-horizon planning is critical for foundation model agents, yet how to fundamentally improve it remains unclear. Existing models are trained on uncontrollable and opa...

Agents

Papers with Code paper 2d ago

ReDesign: Recovering Editable Design Structures from Images via Agentic Decomposition

HiFi-UMI: Learning Deployable Manipulation Policies from High-Fidelity UMI Data Alone

Visual prompt engineering for video models

OmniDelta: Skill-Driven Budget Allocation for Token Compression in OmniLLMs

Wonder: Video World Model Done Better

Shieldstral

DecoupleMix: Decoupled Ratio Search and Convex Allocation for Scalable VLM Data Recipes

From Proprietary to Open-Source: Bridging the Distribution Gap via Multi-Agent Protocol Distillation in Agentic Search

ClinFusion: A Vision-Centric Multimodal LLM System for Holistic Medical Understanding

Data Pyramid for Embodied Manipulation

WorldDiT: A Unified Diffusion Architecture for World and Action Modeling

Rethinking Classifier-Free Guidance in On-Policy Diffusion Distillation

Sol-Attn: Accelerating Video Generation Inference via On-the-Fly Attention Sparsification

FilmBench: A Film-Grade Benchmark for Cinematic Video Generation

Kimi K3: Open Frontier Intelligence

Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

A New Role for Relevance: Guiding Corpus Interaction in Agentic Search

PerceptionBench: Evaluating Atomic Visual Perception in Multimodal Large Language Models

Keep It InMind: Benchmarking the Implicit-Association Blind Spot in Agent Memory

Towards Robust Reinforcement Learning for Small-Scale Language Model Agents

The Physics of Multi-Turn Long-Horizon Planning: From Pre-training to Post-training via Single- and Multi-Teacher On-Policy Agentic Distillation

Evidence Attribution in Visual Document Understanding without Coordinates or Region Labels

Chamaileon: Cross-Context Binder Design with Contextualized Modeling and Mixed Sampling

OmniVAE: An Audio-Video VAE with Cross-Modal Alignment for Joint Generation