Papers | AI Hub

Latest Trending Top

Papers with Code paper Jul 14

Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering

Despite the success of Vision-Language Models (VLMs), misleading charts remain a significant challenge due to their deceptive visual structures and distorted data representations. ...

Agents

Papers with Code paper Jul 14

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets...

Papers with Code paper Jul 14

Self-Improvements in Modern Agentic Systems: A Survey

Self-improving autonomous agents are moving from research prototypes to deployed systems. The primary goal is controllable evolution, or adaptation, from experience with minimal or...

Agents

Papers with Code paper Jul 14

VisCo: Leveraging Large Language Models as Intrinsic Encoders for Visual Token Compression

Vision-language models (VLMs) process large numbers of visual tokens, resulting in substantial inference latency and memory overhead. This has motivated extensive research on visua...

Papers with Code paper Jul 14

Color Pass-Through via Camera-Display Coupling

When a real-world scene is captured by a smartphone camera and viewed on its screen, the displayed image often differs noticeably from the original scene in color, brightness, and ...

Papers with Code paper Jul 14

ReflectWorld-MM: An Entity-Oriented Multimodal Memory System for Open-Ended Video Streams

Building assistants that can continually watch the world, remember what they see, and reason over their accumulated experience is a long-standing goal, and recently multimodal agen...

Multimodal

Papers with Code paper Jul 14

Rethinking the Evaluation of Harness Evolution for Agents

We revisit the evaluation of automatic harness evolution for LLM agents. Existing harness evolution methods use unit test cases to search for harness configurations and then report...

Papers with Code paper Jul 14

Function-Aware Fill-in-the-Middle as Mid-Training for Coding Agent Foundation Models

Coding agents must integrate external tool returns into ongoing reasoning - a capability that standard left-to-right pretraining on code exposes only in its forward direction. We o...

Papers with Code paper Jul 14

UniVR: Thinking in Visual Space for Unified Visual Reasoning

Learning broad world knowledge directly from raw visual data is a fundamental capability of intelligence. We introduce UniVR, the first investigation into simultaneously learning c...

Papers with Code paper Jul 14

Self in Space: Benchmarking Self-Awareness and Spatial Cognition in UAV Embodied Intelligence

Autonomous UAV systems increasingly rely on multimodal large language models (MLLMs) to operate in complex real-world environments. Such embodied scenarios require not only underst...

Papers with Code paper Jul 14

Edge-Aware Thermal Infrared UAV Swarm Tracking

Thermal infrared (TIR) imaging is essential for UAV swarm operations in visually degraded environments. However, tracking tiny UAVs remains challenging due to limited appearance cu...

Papers with Code paper Jul 14

PalmClaw: A Native On-Device Agent Framework for Mobile Phones

Large Language Model (LLM) agents have moved beyond generating responses to executing multi-step tasks by calling tools, observing the results, and iteratively deciding the next ac...

Papers with Code paper Jul 13

Metacognition in LLMs: Foundations, Progress, and Opportunities

Metacognition is a foundational component of intelligence critical to effective learning, problem solving, decision-making, communication, and more. In recent years, it has become ...

Papers with Code paper Jul 13

Evidence-Backed Video Question Answering

Current Video Large Language Models (Video LLMs) excel in question answering (QA) but largely operate as black boxes, providing textual answers without verifiable visual grounding....

Papers with Code paper Jul 13

Xiaomi-Robotics-U0: Unified Embodied Synthesis with World Foundation Model

Recent foundation image and video generation models offer strong generalization and controllability, but their direct application to embodied scenarios is limited by requirements f...

LLM

Papers with Code paper Jul 13

MonkeyOCRv2: A Visual-Text Foundation Model for Document AI

Mainstream visual encoders are pretrained on natural images and cannot be effectively applied to document images without document-oriented adaptation, as dense text and fine-graine...

LLM

Papers with Code paper Jul 13

Vinci2: Providing Proactive Assistance in Continuous Egocentric Videos

When should an intelligent assistant speak up without being asked? Continuous egocentric video offers rich, evolving context that enables a new form of assistance: one that is proa...

Papers with Code paper Jul 13

Multi-Agent LLMs Fail to Explore Each Other

Exploration is essential for reliable autonomy in multi-agent systems, yet it remains unclear whether large language model (LLM) agents can explore effectively when interacting wit...

Papers with Code paper Jul 13

AsySplat: Efficient Asymmetric 3D Gaussian Splatting for Long-Sequence Scene Modeling

Recent generalizable 3D Gaussian Splatting models have advanced long-sequence novel view synthesis (NVS), but at the cost of substantial redundant computation. We identify that the...

Papers with Code paper Jul 13

MetaView: Monocular Novel View Synthesis with Scale-Aware Implicit Geometry Priors

Current visual generation models are capable of producing high-quality content, yet they lack a coherent perception of the spatial structure. Existing generative novel view synthes...

Papers with Code paper Jul 13

A Vocabulary for Multi-Agent Automated Research Systems

We introduce a vocabulary for automated research systems built from one or more agents to make their design choices easier to describe and compare. The vocabulary specifies 1) who ...

Papers with Code paper Jul 13

Latent-Identity Tuning in Text-to-Image Personalization Models

Generating and editing a person's face demands high precision, as even minor modifications can significantly alter a subject's perceived identity. Current personalization and editi...

Image Generation

Papers with Code paper Jul 13

See like a Robot: Robot-Centric Pointmaps for Vision-Language-Action Models

Vision-language-action (VLA) models predict robot actions from visual observations and language instructions. These actions are defined in the robot's own 3D coordinate frame, yet ...

Multimodal Robotics

Papers with Code paper Jul 13

MAGIC: Transition-Aware Generation of Navigable Multi-Scene Game Worlds with Large Language Models

Multi-scene navigation (clearing an objective in one bounded space and then crossing a portal into the next) is a defining feature of contemporary 3D games, but authoring it is lab...