Papers | AI Hub

Latest Trending Top

Papers with Code paper Jul 16

Video = World + Event Stream

We present Wan-Streamer v0.3, which reframes our native-streaming interaction model under a single organizing view: a video is a world plus an event stream. The world is the persis...

Papers with Code paper Jul 16

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

A growing gap separates inference context lengths from RL post-training: inference systems are approaching million-token contexts, while post-training workloads often remain at 256...

AI Hardware

Papers with Code paper Jul 16

SUFLECA: Scaling Up Feature Learning for CAD-to-image Alignment

CAD-to-image alignment aims to estimate an object's 9D pose (rotation, translation, and anisotropic scale) from a single RGB image, enabling applications in robotics and augmented ...

Safety/Alignment

Papers with Code paper Jul 16

RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

Skills are a useful abstraction for software agents, turning human and agent experience into reusable procedural knowledge. Yet existing skill libraries are mostly hand-written, te...

Multimodal

Papers with Code paper Jul 16

Hierarchical Denoising For Multi-Step Visual Reasoning

Video models are evolving into vision foundation models, yet they still lack human-like multi-step reasoning. Streaming autoregressive diffusion models are efficient but limited in...

Papers with Code paper Jul 16

xHC: Expanded Hyper-Connections

Hyper-Connections (HC) expand the residual stream of Transformers into N parallel streams, providing a form of memory scaling beyond model width and depth. Manifold-Constrained HC ...

Papers with Code paper Jul 16

VideoChat3: Fully Open Video MLLM for Efficient and Generalist Video Understanding

Recent advances in video understanding have spanned motion, long video, and streaming interaction, driving this field toward real-world applications. Despite this progress, current...

Papers with Code paper Jul 16

VIABench: A Comprehensive Video Benchmark Collected from Blind Individuals for Visual Impairment Assistance

Visually impaired individuals (VIIs) encounter significant daily challenges due to limited access to visual information. Although Multimodal Large Language Models (MLLMs) have achi...

Benchmark

Papers with Code paper Jul 16

Xiaomi-Robotics-1: Scaling Vision-Language-Action Models with over 100K Hours of Real-World Trajectories

We present Xiaomi-Robotics-1, a foundational vision-language-action (VLA) model capable of (1) following diverse language instructions to perform a wide range of mobile manipulatio...

Multimodal

Papers with Code paper Jul 16

On-Policy Delta Distillation

On-policy distillation is an alternative post-training method in reinforcement learning that alleviates the constraints imposed by reward models by providing token-level supervisio...

Papers with Code paper Jul 16

Multi-Turn On-Policy Distillation with Prefix Replay

We study on-policy distillation (OPD) for agentic tasks, where an LLM agent interacts with an environment over multiple turns and a student imitates a teacher over these multi-turn...

Papers with Code paper Jul 16

Beyond Entropy: Correctness-Aware Advantage Shaping via Contrastive Policy Optimization

Reinforcement learning with verifiable rewards (RLVR) commonly uses entropy for advantage shaping. However, entropy cannot distinguish useful uncertainty from detrimental confusion...

Papers with Code paper Jul 15

AgentCompass: A Unified Evaluation Infrastructure for Agent Capabilities

As Large Language Models (LLMs) evolve into autonomous agents, the need for unified evaluation infrastructure becomes critical. However, current evaluation pipelines remain highly ...

Papers with Code paper Jul 15

GigaWorld-Policy-0.5: A Faster and Stronger WAM Empowered by AutoResearch

World Action Models (WAMs) improve robot policy learning by jointly modeling actions and future visual observations, using future scene evolution as dense supervision for physicall...

Papers with Code paper Jul 15

KnowAct-GUIClaw: Know Deeply, Act Perfectly, Personal GUI Assistant with Self-Evolving Memory and Skill

OpenClaw has emerged as a leading agent framework for complex task automation, yet it faces insufficient cross-platform GUI interaction support and a well-built self-evolution mech...

Papers with Code paper Jul 15

From Pixels to States: Rethinking Interactive World Models as Game Engines

Building interactive worlds that respond coherently to player actions has long been a shared goal of computer graphics, games, and artificial intelligence. Recent video generative ...

Papers with Code paper Jul 15

OvisOCR2 Technical Report

We introduce OvisOCR2, a 0.8B document parsing model. OvisOCR2 is designed as an end-to-end parser: given a document page image, it generates a Markdown representation in natural r...

Papers with Code paper Jul 15

Smarter and Cheaper at Once: Byte-Exact KV-Cache Grafting Turns a Frozen Small Model into a Verified-Knowledge Flywheel

We report a way to make a frozen small language model both more capable and dramatically cheaper at once, without changing any weights. Verified knowledge is deposited once as a by...

Google

Papers with Code paper Jul 15

Multi-Head Latent Control: A Unified Interface for LLM Agent Decision Making

Large language models are increasingly deployed as agents, but reliable agentic behavior requires more than next-token prediction. At inference time, it is preferred that an agent ...

LLM

Papers with Code paper Jul 15

Cura 1T: Specialized Model for Agentic Healthcare

Healthcare spans high-stakes communication, expert reasoning, and workflow execution, yet specialized LLMs that cover these use cases together remain limited. A healthcare model mu...

Agents

Papers with Code paper Jul 15

Open-AoE: An Open Egocentric Manipulation Dataset and Toolchain for Embodied Learning

Egocentric videos of human manipulation provide scalable supervision for embodied intelligence, yet existing resources rarely combine low-cost continuous capture, manipulation-leve...

Robotics

Papers with Code paper Jul 15

Diagnosing and Calibrating Tool-Call Boundary Drift in Multi-Teacher On-Policy Distillation

Agentic language models must learn when to call tools, when to consume tool responses, and when to answer directly. This makes multi-teacher on-policy distillation a natural traini...

Papers with Code paper Jul 15

RxBrain: Embodied Cognition Foundation Model with Joint Language-Visual Reasoning and Imagination

Embodied cognition requires agents to connect high-level task reasoning with the physical states to be achieved. We introduce Hy-Embodied-RxBrain, an embodied cognition foundation ...

LLM

Papers with Code paper Jul 15

Hallo4D: Multi-Modal Hallucination Mitigation for Consistent Spatio-Temporal Generation

While recent advances in 3D generation have enabled impressive visual synthesis, existing methods often rely on 2D diffusion supervision without explicit mechanisms for geometric c...