#Multimodal | AI Hub

Papers with Code paper Apr 1

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explore...

Multimodal Agents

21

Papers with Code paper Apr 1

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilitie...

Multimodal

21

NewsData.io news Apr 1

IIT Mandi Celebrates 17th Foundation Day, Reinforcing Its Vision of Nation-Building with Innovation & Research

New Delhi: The Indian Institute of Technology (IIT) Mandi, one of India’s leading IITs, celebrated its 17th Foundation Day by bringing together eminent leaders from industry, acade...

Multimodal

21

Papers with Code paper Apr 1

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference...

Multimodal

21

Mastodon discussion Apr 1

SpatialPoint, a framework from Visincept, Tsinghua University and IDEA, integrates depth data as a core input for vision...

SpatialPoint, a framework from Visincept, Tsinghua University and IDEA, integrates depth data as a core input for vision-language models, enabling robots to generate precise 3D coo...

Multimodal

24

Mastodon discussion Mar 31

https://winbuzzer.com/2026/03/31/alibaba-qwen35-omni-closed-source-multimodal-ai-xcxwbn/Alibaba Keeps Qwen3.5-Omni Close...

https://winbuzzer.com/2026/03/31/alibaba-qwen35-omni-closed-source-multimodal-ai-xcxwbn/Alibaba Keeps Qwen3.5-Omni Closed, Breaks Open-Source Streak#AI #AudioAI #Alibaba #Qwen35Omn...

Multimodal

18

Mastodon discussion Mar 31

【Granite 4.0 3B Vision：企業文書向けコンパクトマルチモーダルインテリジェンス】https://huggingface.co/blog/ibm-granite/granite-4-vision※AI生成の自動投稿（見出し...

【Granite 4.0 3B Vision：企業文書向けコンパクトマルチモーダルインテリジェンス】https://huggingface.co/blog/ibm-granite/granite-4-vision※AI生成の自動投稿（見出し＋リンク）#AI #生成AI #LLM #AIGenerated

Hugging Face Multimodal

24

Mastodon discussion Mar 31

I'm organizing a conference again this year! The OpenCV-SID Conference on Computer Vision & AI is this May 4th in Los An...

I'm organizing a conference again this year! The OpenCV-SID Conference on Computer Vision & AI is this May 4th in Los Angeles. I have speakers from Disney, Bonsai Robotics, Code19 ...

Multimodal

24

AI Blogs (RSS) news Mar 31

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Multimodal

24

GitHub Trending repo Mar 31

Parth-Badgujar/diffusion-immunization: Official implementation of paper "One Noise to Fool Them All: Universal Adversarial Defenses Against Image Editing", accepted at The 5th Workshop of Adversarial Machine Learning on Computer Vision: Foundation Models + X, IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) 2025

Official implementation of paper "One Noise to Fool Them All: Universal Adversarial Defenses Against Image Editing", accepted at The 5th Workshop of Adversarial Machine Learning on...

Multimodal

35

Mastodon discussion Mar 31

📰 Prescription Smart Glasses 2026: Meta’s Ray-Ban AI Eyewear for Vision CorrectionMeta’s new prescription smart glasses ...

📰 Prescription Smart Glasses 2026: Meta’s Ray-Ban AI Eyewear for Vision CorrectionMeta’s new prescription smart glasses combine AI-powered features with stylish design, making them...

Multimodal

18

Mastodon discussion Mar 31

📰 Qwen 3.5 Omni (2026): Alibaba's AI Sees, Hears, and Codes in Real TimeQwen 3.5 Omni, Alibaba’s latest multimodal AI, c...

📰 Qwen 3.5 Omni (2026): Alibaba's AI Sees, Hears, and Codes in Real TimeQwen 3.5 Omni, Alibaba’s latest multimodal AI, can interpret visual and audio inputs to explain papers and w...

Multimodal

9

Dev.to tutorial Mar 31

Building a Real-Time Security Camera System with Local Vision LLMs

I replaced my Lorex NVR's motion detection — which alerted me 40 times a day about swaying trees and...

Multimodal

12

Mastodon discussion Mar 31

Alibaba's Qwen team unveiled Qwen3.5 Omni, a native multimodal model processing text, audio and video in a single pipeli...

Alibaba's Qwen team unveiled Qwen3.5 Omni, a native multimodal model processing text, audio and video in a single pipeline. The flagship Plus model beats Gemini 3.1 Pro in audio un...

Google Multimodal

9

Papers with Code paper Mar 31

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, the...

Multimodal

21

Mastodon discussion Mar 31

📰 Qwen3.5-Omni: Alibaba’s 2026 Multimodal AI Breakthrough in Vision, Audio & CodingAlibaba's Qwen3.5-Omni represents a l...

📰 Qwen3.5-Omni: Alibaba’s 2026 Multimodal AI Breakthrough in Vision, Audio & CodingAlibaba's Qwen3.5-Omni represents a leap in multimodal AI, simultaneously processing visual, audi...

Google Multimodal

9

Mastodon discussion Mar 31

📰 2026'nın Devrimci AI'sı: Qwen3.5-Omni ile Gör, İşit ve Kodla | Alibaba Multimodal Yapay ZekaAlibaba'nın yeni nesil mul...

📰 2026'nın Devrimci AI'sı: Qwen3.5-Omni ile Gör, İşit ve Kodla | Alibaba Multimodal Yapay ZekaAlibaba'nın yeni nesil multimodal yapay zekâsı Qwen3.5-Omni, insan beyni gibi aynı and...

Multimodal

9

Mastodon discussion Mar 31

Alibaba's Qwen team has released Qwen3.5-Omni, a fully multimodal AI model supporting text, image, audio and video input...

Alibaba's Qwen team has released Qwen3.5-Omni, a fully multimodal AI model supporting text, image, audio and video inputs and outputs. The model achieves 215 state-of-the-art resul...

Multimodal

27

GitHub Trending repo Mar 31

jd-opensource/JoyAI-Image: JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing.

JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing.

LLM Image Generation Multimodal

46

GNews news Mar 31

Trump shares 'North Korean-style' vision for 'gold vomit' presidential library

The video generated by Artificial Intelligence showed a giant gold-covered portico, a lobby with US military aircrafts on display and an auditorium with a giant gold statue of Trum...

Multimodal

18

Mastodon discussion Mar 31

📰 Qwen3.5-Omni Beats Gemini-3.1 Pro in 2026 Multimodal AI Benchmark — Costs 90% LowerQwen3.5-Omni, Alibaba's latest mult...

📰 Qwen3.5-Omni Beats Gemini-3.1 Pro in 2026 Multimodal AI Benchmark — Costs 90% LowerQwen3.5-Omni, Alibaba's latest multimodal AI model, delivers superior performance across benchm...

Google Multimodal Benchmark

9

Mastodon discussion Mar 31

📰 Qwen3.5 Omni 2026: The Native Multimodal AI That Outperforms GeminiQwen3.5 Omni, Alibaba’s latest AI model, sets a new...

📰 Qwen3.5 Omni 2026: The Native Multimodal AI That Outperforms GeminiQwen3.5 Omni, Alibaba’s latest AI model, sets a new standard in native multimodal intelligence by seamlessly in...

Google Multimodal

9

NewsData.io news Mar 31

Machine Vision Market Outlook 20262030: Automation Trends And Industrial Applications

(MENAFN - EIN Presswire) EINPresswire/ -- "The machine vision market is dominated by a mix of global industrial automation solution providers and specialized imaging technology com...

Multimodal

21

NewsData.io news Mar 31

Shandong Extreme Vision Technology Co., Ltd.: From Fragmentation to Scale: Extreme Vision Bridges the B2B AI Chasm with Platform + Ecosystem

HONG KONG, Mar 31, 2026 - (ACN Newswire) - Bringing artificial intelligence from the laboratory to a broad spectrum of industries-particularly in the B2B market-demands that AI com...

Multimodal

21