#Multimodal | AI Hub

Papers with Code paper Mar 22

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-...

Multimodal

21

GitHub Trending repo Mar 20

OranDuanStudy/LRCM: LRCM: Listen to Rhythm, Choose Movements — Autoregressive Multimodal Dance Generation via Diffusion and Mamba

LRCM: Listen to Rhythm, Choose Movements — Autoregressive Multimodal Dance Generation via Diffusion and Mamba

Multimodal

43

Papers with Code paper Mar 20

PersonaVLM: Long-Term Personalized Multimodal LLMs

Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. P...

Multimodal

21

Papers with Code paper Mar 19

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biase...

Multimodal

21

Papers with Code paper Mar 17

Tabular LLMs for Interpretable Few-Shot Alzheimer's Disease Prediction with Multimodal Biomedical Data

Accurate diagnosis of Alzheimer's disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to...

Multimodal

21

ArXiv paper Mar 17

What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as...

Multimodal

18

ArXiv paper Mar 17

Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence

Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to genera...

Multimodal

18

ArXiv paper Mar 17

WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation

Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as...

Multimodal

18

ArXiv paper Mar 17

IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentati...

LLM Multimodal

18

Papers with Code paper Mar 17

Anticipatory Planning for Multimodal AI Agents

Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reaso...

Multimodal

21

Papers with Code paper Mar 17

ViT-AdaLA: Adapting Vision Transformers with Linear Attention

Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits ...

Multimodal

21

AI Blogs (RSS) news Mar 16

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Can LLMs aut...

Multimodal

24

Mastodon discussion Jul 18

OpenAI、これまでより60％低価格で高性能なAIモデル「GPT-4o mini」を発表 OpenAIは、低コストでパフォーマンスに優れる新しいマルチモーダルAIモデル「GPT-4o mini」を発表した。GPT-4o miniは、従来の...

OpenAI、これまでより60％低価格で高性能なAIモデル「GPT-4o mini」を発表 OpenAIは、低コストでパフォーマンスに優れる新しいマルチモーダルAIモデル「GPT-4o mini」を発表した。GPT-4o miniは、従来のGPT-3.5 Turboに比べて60%低コストでありながら、より高いパフォーマンスを発揮するという。この革新的なモデル...

OpenAI Multimodal

9

Mastodon discussion May 13

OpenAIの新たなAIモデル「GPT-4o」は、人間のようにリアルタイムにテキスト、音声、写真を分析して返答でき、Siriを原始的に見せる OpenAIは本日オンラインイベントを開催し、オーディオ、ビジョン、テキストを“リアルタイム”で推...

OpenAIの新たなAIモデル「GPT-4o」は、人間のようにリアルタイムにテキスト、音声、写真を分析して返答でき、Siriを原始的に見せる OpenAIは本日オンラインイベントを開催し、オーディオ、ビジョン、テキストを“リアルタイム”で推論できる新たな基幹モデルである「GPT-4o」（oはOmniを表す）を発表した。ネーミングからすると、OpenAIの中で...

OpenAI Multimodal

9