Papers | AI Hub

Latest Trending Top

Papers with Code paper 3d ago

OmniVAE: An Audio-Video VAE with Cross-Modal Alignment for Joint Generation

Recent generative models are moving beyond silent video or standalone audio synthesis toward the joint generation of synchronized audio and video. Despite this progress, jointly ge...

Safety/Alignment

Papers with Code paper 3d ago

DriveDNA: A Large-Scale Multimodal Naturalistic Driving Dataset and Benchmark for Driving Style Identification

Driving style captures stable, driver-specific patterns in how a vehicle is driven. In naturalistic data, however, this signal is hard to isolate because drivers are observed in di...

Multimodal Benchmark

Papers with Code paper 3d ago

Characterizing Warp Divergence from Pascal to Blackwell

Since Volta introduced Independent Thread Scheduling (ITS), NVIDIA GPUs have been widely assumed to handle warp divergence in a fixed manner. We test this assumption across Ampere,...

Papers with Code paper 3d ago

A Frozen 12B Beats Frontier Models on Verified Work: 100% Accuracy, 0 Tokens, Bit-Exact, Forever

Improving a language model today means retraining it: enormous compute, a new opaque model each cycle, non-deterministic output. We take the opposite path: the model stays frozen, ...

Papers with Code paper 3d ago

GNM Head: A Generative aNthropometric Model of the human head

Parametric models of the human head are essential tools traditionally used in computer vision and graphics for animation, rendering, and reconstruction. More recently, they serve a...

Papers with Code paper 3d ago

JarvisHub: An Open Harness for Canvas-Native Multimodal Creative Agents

Creative AI is moving from single-step asset generation toward long-horizon multimodal production. Although recent generative models can synthesize high-quality images, videos, aud...

Multimodal

Papers with Code paper 4d ago

Bitcoin Price Direction Prediction via Regime-Aware Multi-Modal Fusion of Social Sentiment and Technical Features

Bitcoin price prediction on sub-daily timescales is a hard open problem in computational finance. Bitcoin exhibits fat-tailed returns, non-stationary dynamics, and a price discover...

Papers with Code paper 4d ago

UltraViT: Latency-Optimized On-device Vision Encoder for Large Vision-Language Models

Large Vision-Language Models (LVLMs) remain bottlenecked by massive computational footprints, precluding their deployment on resource-constrained edge devices. While efforts to com...

Multimodal

Papers with Code paper 4d ago

IndicTalk: A Large-Scale Persona-Based Multilingual Conversational Corpus for Indic Languages

Large Language Models (LLMs) have transformed conversational AI, yet high-quality multilingual code-mixed dialogue resources remain scarce, particularly for Indic languages where s...

Papers with Code paper 4d ago

dRAE: Representation Autoencoder with Hyper-Spherical Codes

In this work, we aim to discretize the high-dimensional visual representations to bridge the gap with language models - a non-trivial challenge, as existing quantization methods su...

Papers with Code paper 5d ago

StateAct: Program State, before Pixels, for Long-Horizon Computer-Use Agents

Computer-use agents are usually improved by strengthening perception: better models for reading a screenshot and choosing where to click. Yet a screenshot is only a lossy rendering...

Papers with Code paper 5d ago

ID-V2V: Identity-Preserving Video Restylization

In visual storytelling, human performances are central to creative intent and narrative meaning. However, preserving human identity and performance while enabling flexible visual e...

Papers with Code paper 5d ago

LAMAR: An Open Language-Aware Multilingual Alignment Reranker

In multilingual retrieval augmented generation, a retriever can retrieve relevant documents written in multiple languages, which are subsequently reranked before answer generation....

Safety/Alignment

Papers with Code paper 5d ago

Scaling Native Multimodal Pre-Training From Scratch

Although large language models (LLMs) exhibit remarkable reasoning capabilities, their reliance on text-only pre-training restricts the perception of the multimodal physical world....

Multimodal

Papers with Code paper 5d ago

SceneActBench: Can Agents Act on the 3D Scenes They See?

Vision-language model (VLM) agents increasingly use tools to act on 3D scenes rather than only describe them. Existing 3D benchmarks score textual responses or single-object operat...

Papers with Code paper 5d ago

Skill Self-Play: Pushing the Frontier of LLM Capability with Co-Evolving Skills

LLM training is shifting from manual design and annotation to interaction-driven self-evolution. However, existing self-evolutionary methods face a fundamental dilemma between task...

LLM

Papers with Code paper 5d ago

Reasoning Denoiser: Denoising Reasoning Traces for Hallucination Detection in Large Reasoning Models

Large reasoning models (LRMs) generate long reasoning traces before producing final answers. While these traces may contain useful signals for hallucination detection, harnessing t...

Papers with Code paper 5d ago

IDEAgent: Agentic Quality-Diversity Search for Research Idea Generation

Large Language Models (LLMs) have significantly automated the process of scientific discovery over the past few years. However, existing systems share one core limitation: they gen...

Agents

Papers with Code paper 5d ago

Leveraging External Knowledge for Historical Document Restoration via Retrieval-Augmented Large Language Models

Historical documents act as invaluable knowledge archives but often suffer from illegibility due to physical deterioration and damage. While existing restoration methods based on m...

Papers with Code paper 5d ago

Spectral Prior for Reducing Exposure Bias in Diffusion Models

Diffusion models typically suffer from error accumulation during iterative sampling, commonly referred to as exposure bias. We reveal systematic frequency-dependent discrepancies b...

Papers with Code paper 6d ago

Self-Supervised Learning of Structured Dynamics from Videos

Understanding motion in video is a fundamental challenge for visual learning, as frame-to-frame change entangles two sources of dynamics: camera motion and object motion. This deco...

Papers with Code paper 6d ago

OpenForgeRL: Train Harness-native Agents in Any Environment

Modern AI agents rely on elaborate inference harnesses such as Claude Code, Codex, and OpenClaw to drive multi-turn reasoning, tool use, and access to external systems. While power...

Anthropic

Papers with Code paper 6d ago

Visual Contrastive Self-Distillation

On-policy self-distillation (OPSD) is promising as it removes the external teacher required by on-policy distillation (OPD), yet it still needs asymmetric information between teach...

Papers with Code paper 6d ago

Closing the Loop: Training-Free Revisit Consistency for Autoregressive Generative Rendering

Recent conditional video generation models have shown promising potentials to transform 3D engine renderings, such as depth maps and untextured geometry, into photorealistic videos...