/// AI HUB
Dashboard News Models Tools Papers Repos Videos Companies Trending
Login

#Multimodal

916 articles tagged with Multimodal

Latest Trending
GNews news May 28

positioned to become global leader in multilingual, multimodal AI solutions: Google Cloud India MD

India is well-positioned to become a global leader in Artificial Intelligence (AI) innovation, particularly in multilingual and multimodal AI solutions with worldwide relevance, ac...

Google Multimodal
18
GitHub Trending repo May 28

modelstudioai/cli: Official Model Studio CLI(阿里云百炼 CLI)built for AI Agent frameworks, exposing models, search, multimodal, and workflow capabilities as structured tool calls.

Official Model Studio CLI(阿里云百炼 CLI)built for AI Agent frameworks, exposing models, search, multimodal, and workflow capabilities as structured tool calls.

Multimodal Agents
64
Papers with Code paper May 28

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stal...

Multimodal
21
Papers with Code paper May 28

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may over...

Multimodal
21
Papers with Code paper May 28

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multim...

Multimodal
21
Papers with Code paper May 28

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into ...

Multimodal
21
Papers with Code paper May 28

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generali...

Multimodal Robotics
21
Papers with Code paper May 28

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large lan...

Multimodal
21
Papers with Code paper May 28

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on st...

Multimodal
21
Papers with Code paper May 28

VLM3: Vision Language Models Are Native 3D Learners

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D u...

Multimodal
21
Papers with Code paper May 28

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irr...

Multimodal
21
Papers with Code paper May 28

Linearizing Vision Transformer with Test-Time Training

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains p...

Multimodal
21
Papers with Code paper May 28

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression add...

Multimodal
21
Papers with Code paper May 28

Multimodal Music Recommendation System using LLMs

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has exp...

Multimodal
21
NewsData.io news May 27

Sharp Vision 2026 Warning: Market Investigation Exposes Red Flags in the Counterfeit Eye Health Supplement Market

New York City, NY, May 27, 2026 (GLOBE NEWSWIRE) -- Sharp Vision and the Counterfeit Supplement Market Overview The Sharp Vision 2026 warning has emerged as a significant market is...

Multimodal
21
Dev.to tutorial May 27

Chinese AI Models Are 40x Cheaper Than GPT-4o — Here's the Proof

Honestly, when I first saw the numbers I didn't believe them. DeepSeek V4 Flash at $0.25/M output vs...

Multimodal
12
YouTube video May 27

Sam Altman’s AI Vision Sparks Major Backlash Online | PakCan News

Sam Altman's AI Vision Sparks Major Backlash Online | PakCan News Sam Altman says the future is one where intelligence ...

Multimodal
45
Dev.to tutorial May 27

Multimodal AI for Cybersecurity Operations: Practical Use Cases, Local Deployment, and Hard Lessons

A practical security-operations view of multimodal AI for SOC, incident response, phishing triage, cloud reviews, WAF analysis, vulnerability management, audit evidence, and local ...

Multimodal
12
Mastodon discussion May 27

🚀 Fastest-growing AI projects today1. The integration of vision and voice components into language models becoming incre...

🚀 Fastest-growing AI projects today1. The integration of vision and voice components into language models becoming increasing...2. The jingyaogong/minimind-o repository stands out ...

Multimodal API
18
Papers with Code paper May 27

From Pixels to Words -- Towards Native One-Vision Models at Scale

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragmen...

Multimodal
21
Papers with Code paper May 27

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. ...

Multimodal Agents
21
Papers with Code paper May 27

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In t...

Multimodal
21
Papers with Code paper May 27

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schem...

Multimodal
21
Mastodon discussion May 26

🚀 Fastest-growing AI projects today1. bytedance/Lance a lightweight native unified multimodal model designed for image a...

🚀 Fastest-growing AI projects today1. bytedance/Lance a lightweight native unified multimodal model designed for image and vi...2. With a growth score of 67.30, it appears to be gr...

Multimodal
18
« Previous Page 8 of 39 (916 items) Next »
AI Hub // AI Intelligence Platform // LIVE FEED // Impressum // Datenschutz © 2026
0 new articles available