#Multimodal | AI Hub

Papers with Code paper Mar 29

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge....

Multimodal

21

Papers with Code paper Mar 29

Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on clo...

Multimodal

21

Mastodon discussion Mar 28

📰 Tian Gong AI Unveils 2026 Multimodal Model to Dominate Global AI RaceChinese AI startup Tian Gong AI has unveiled a gr...

📰 Tian Gong AI Unveils 2026 Multimodal Model to Dominate Global AI RaceChinese AI startup Tian Gong AI has unveiled a groundbreaking multimodal model, signaling its entry into the ...

Multimodal

9

Papers with Code paper Mar 28

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Understanding charts requires models to jointly reason over geometric visual patterns, structured numerical data, and natural language -- a capability where current vision-language...

Multimodal

21

Papers with Code paper Mar 28

Structural Graph Probing of Vision-Language Models

Vision-language models (VLMs) achieve strong multimodal performance, yet how computation is organized across populations of neurons remains poorly understood. In this work, we stud...

Multimodal

21

Mastodon discussion Mar 27

この大事な時期にJuli CloverAppleに巻き込まれて被弾でもしたら……New Apple Immersive Video of BBC Proms Concert on Apple Vision Pro https://www.m...

この大事な時期にJuli CloverAppleに巻き込まれて被弾でもしたら……New Apple Immersive Video of BBC Proms Concert on Apple Vision Pro https://www.macrumors.com/2026/03/27/apple-immersive-video-bbc-proms-conc...

Multimodal

30

Mastodon discussion Mar 27

📰 Claude Mythos Leak 2026: AI Model Smashes GPT-4o Benchmarks in Anthropic Security BreachA security lapse at Anthropic ...

📰 Claude Mythos Leak 2026: AI Model Smashes GPT-4o Benchmarks in Anthropic Security BreachA security lapse at Anthropic has exposed Claude Mythos, a newly developed AI model with d...

Anthropic Multimodal

9

Mastodon discussion Mar 27

📰 2026 Vision in Machine Learning: From CNNs to Multimodal AI Systems (LLaVA, JEPA, Vision Transfor...The future of visi...

📰 2026 Vision in Machine Learning: From CNNs to Multimodal AI Systems (LLaVA, JEPA, Vision Transfor...The future of vision in ML is being reshaped by multimodal models and open-sou...

Multimodal

9

Mastodon discussion Mar 27

📰 Vision AI'nın 2026 Geleceği: LLaVA, Hugging Face ve Vision Transformers ile Görsel Algıyı Yeniden...Makine öğrenmesind...

📰 Vision AI'nın 2026 Geleceği: LLaVA, Hugging Face ve Vision Transformers ile Görsel Algıyı Yeniden...Makine öğrenmesinde vision AI, sadece görüntü tanıma değil, dünya anlama kapas...

Hugging Face Multimodal

9

Mastodon discussion Mar 27

コジマ、研究してみる価値がありそうですねApple、新しいApple Immersive Videoとして「Debut at the BBC Proms」を配信開始 https://www.macotakara.jp/Vision/entr...

コジマ、研究してみる価値がありそうですねApple、新しいApple Immersive Videoとして「Debut at the BBC Proms」を配信開始 https://www.macotakara.jp/Vision/entry-50766.html#Apple #LLM #news #bot

Multimodal

27

Mastodon discussion Mar 27

Google has released Gemini 3.1 Flash Live, a real-time multimodal voice model for AI agents. The system achieves 90ms ti...

Google has released Gemini 3.1 Flash Live, a real-time multimodal voice model for AI agents. The system achieves 90ms time-to-first-audio and handles noisy environments, addressing...

Google Multimodal

9

NewsData.io news Mar 27

J. King Kasr Unveils Web4 Vision as Lithosphere Makalu Testnet Activates

The launch positions Lithosphere as a foundation for intelligence-native infrastructure, advancing AI coordination and autonomous execution within decentralized systems.

Multimodal

21

Mastodon discussion Mar 27

📰 China AI 2026: Daily Token Consumption Hits 30 Trillion — How Qiu Xipeng Is Reshaping Multimodal AIChina's AI ecosyste...

📰 China AI 2026: Daily Token Consumption Hits 30 Trillion — How Qiu Xipeng Is Reshaping Multimodal AIChina's AI ecosystem is accelerating as daily token consumption surpasses 30 tr...

Multimodal

18

Mastodon discussion Mar 27

The Architecture of Modern Browser AutomationReliable web automation needs: recipe engine, AI agent, vision fallback, pr...

The Architecture of Modern Browser AutomationReliable web automation needs: recipe engine, AI agent, vision fallback, proxy manager. From deterministic workflows to autonomous expl...

Multimodal Agents

18

Mastodon discussion Mar 27

Google has released Gemini 3.1 Flash Live, a real-time multimodal voice model designed for AI agents. The model processe...

Google has released Gemini 3.1 Flash Live, a real-time multimodal voice model designed for AI agents. The model processes audio, video, and tool calls natively without the delays o...

Google Multimodal

9

NewsData.io news Mar 27

Saylor’s $1M Bitcoin Vision: Can Aggressive Accumulation Trigger A Supply Shock?

MicroStrategy’s aggressive Bitcoin accumulation in 2026, aiming for 1 million BTC, is reshaping supply dynamics and market structure, accelerating price pressure without constituti...

Multimodal

21

ArXiv paper Mar 26

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow the...

Multimodal

18

ArXiv paper Mar 26

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge,...

Multimodal Safety/Alignment

18

ArXiv paper Mar 26

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for vi...

Multimodal

18

Mastodon discussion Mar 26

ランス、外なら興味を持ってくれるかなSHINOBU工房のApple Vision Pro用スリープ防止パーツ「KeepOn Ring Mag 2026型」を試す https://www.macotakara.jp/Vision/entry-...

ランス、外なら興味を持ってくれるかなSHINOBU工房のApple Vision Pro用スリープ防止パーツ「KeepOn Ring Mag 2026型」を試す https://www.macotakara.jp/Vision/entry-50796.html#Apple #LLM #news #bot

Multimodal

27

ArXiv paper Mar 26

VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be trans...

Multimodal

18

ArXiv paper Mar 26

HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models

Achieving human-like spatial intelligence for vision-language models (VLMs) requires inferring 3D structures from 2D observations, recognizing object properties and relations in 3D...

Multimodal

18

Papers with Code paper Mar 26

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregres...

Multimodal

21

ArXiv paper Mar 26

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., ...

Multimodal

18