#Multimodal | AI Hub

Papers with Code paper Mar 26

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregres...

Multimodal

21

ArXiv paper Mar 26

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., ...

Multimodal

18

ArXiv paper Mar 26

LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models re...

Multimodal

18

ArXiv paper Mar 26

PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for imag...

Multimodal

18

ArXiv paper Mar 26

Multimodal Dataset Distillation via Phased Teacher Models

Multimodal dataset distillation aims to construct compact synthetic datasets that enable efficient compression and knowledge transfer from large-scale image-text data. However, exi...

Multimodal

18

Dev.to tutorial Mar 26

A Serverless Blueprint for Multimodal Video Search on AWS

Originally published on Build With AWS. Subscribe for weekly AWS builds. This design was inspired...

Multimodal

12

Mastodon discussion Mar 26

Browser Automation in 2026: Beyond Simple ScriptsThe next generation of web automation uses AI vision models to understa...

Browser Automation in 2026: Beyond Simple ScriptsThe next generation of web automation uses AI vision models to understand page layouts, proxy rotation for reliability, and multi-s...

Multimodal

18

NewsData.io news Mar 26

Shaping Dreams: A California High School Student’s Vision for Global Cybersecurity

As digital threats continue to grow across the globe, a student-led initiative that started in California is taking steps to instill in young people the knowledge and tools to navi...

Multimodal

21

Papers with Code paper Mar 26

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancem...

LLM Multimodal

21

NewsData.io news Mar 26

STMicroelectronics and Leopard Imaging accelerate robotics vision with NVIDIA Jetson-ready multi-sensor module

Robotics Vision Modules: STMicroelectronics and Leopard Imaging have introduced an all-in-one multimodal vision module for humanoid and other advanced robotics systems. Combining S...

NVIDIA Multimodal

21

Mastodon discussion Mar 26

なぜアリババはAIに8兆円も投じるのか。CEOが語る「破格の攻勢」https://digiday.jp/modern-retail/alibabas-vision-for-the-agentic-era-comes-into-focus...

なぜアリババはAIに8兆円も投じるのか。CEOが語る「破格の攻勢」https://digiday.jp/modern-retail/alibabas-vision-for-the-agentic-era-comes-into-focus-as-it-targets-100b-in-ai-and-cloud-revenue-over-5-years/#di...

Multimodal Agents

9

Papers with Code paper Mar 26

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied pr...

Multimodal

21

Mastodon discussion Mar 26

“Dense image captioning can be used for a variety of tasks, such as training vision-language and text-to-image models. W...

“Dense image captioning can be used for a variety of tasks, such as training vision-language and text-to-image models. When applied to user-facing features, it can improve image se...

Image Generation Multimodal

27

Mastodon discussion Mar 26

📰 Vision-Guided Web AI Agents: How Multimodal Reasoning Is Transforming Web Automation in 2026Vision-guided web AI agent...

📰 Vision-Guided Web AI Agents: How Multimodal Reasoning Is Transforming Web Automation in 2026Vision-guided web AI agents are revolutionizing automated web interaction by using mul...

Multimodal Agents

18

Mastodon discussion Mar 25

📰 Computer Vision Fish Monitoring: How Volunteers Detect 120+ Species with 92% Accuracy in 2026A new deep learning syste...

📰 Computer Vision Fish Monitoring: How Volunteers Detect 120+ Species with 92% Accuracy in 2026A new deep learning system developed by MIT Sea Grant and Woodwell Climate Research C...

Multimodal

9

NewsData.io news Mar 25

LigoLab Recaps LigoVerse 2026, Outlines Vision for AI-Powered Laboratory Operations and the Shift to Systems of Action

LigoLab recaps LigoVerse, highlighting AI-powered “systems of action” that automate workflows, reduce bottlenecks, and

Multimodal

21

Mastodon discussion Mar 25

📰 Top 3 Amazon Bedrock Multimodal Models for Video Insights in 2026Amazon Bedrock’s multimodal foundation models are rev...

📰 Top 3 Amazon Bedrock Multimodal Models for Video Insights in 2026Amazon Bedrock’s multimodal foundation models are revolutionizing video understanding by enabling scalable, cost-...

Multimodal

9

GitHub Trending repo Mar 25

metiu1/pullai-python_libraries: PullAI is an open-source Python library and application designed to make local Artificial Intelligence simple, powerful, and truly multimodal. Unlike existing solutions, PullAI goes beyond text (LLMs), managing the entire generative AI ecosystem directly on your hardware.

PullAI is an open-source Python library and application designed to make local Artificial Intelligence simple, powerful, and truly multimodal. Unlike existing solutions, PullAI goe...

Multimodal Open Source AI Hardware

35

GitHub Trending repo Mar 25

metiu1/Vortelio-python_libraries: PullAI is an open-source Python library and application designed to make local Artificial Intelligence simple, powerful, and truly multimodal. Unlike existing solutions, PullAI goes beyond text (LLMs), managing the entire generative AI ecosystem directly on your hardware.

PullAI is an open-source Python library and application designed to make local Artificial Intelligence simple, powerful, and truly multimodal. Unlike existing solutions, PullAI goe...

Multimodal Open Source AI Hardware

35

ArXiv paper Mar 25

TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cl...

Multimodal

18

ArXiv paper Mar 25

Vision-Language Models vs Human: Perceptual Image Quality Assessment

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. W...

Multimodal

18

NewsData.io news Mar 25

A people-first vision for the future of work in the age of AI

Authors discuss how to reimagine work in the age of AI to reverse its degradation and protect the role of people in the workplace.

Multimodal

21

ArXiv paper Mar 25

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In...

Multimodal

18

Mastodon discussion Mar 25

What are people exploring in AI right now?At #ArcofAI, sessions dive into AI-enabled apps, multimodal systems, AI-powere...

What are people exploring in AI right now?At #ArcofAI, sessions dive into AI-enabled apps, multimodal systems, AI-powered workflows, and responsible AI.Take a look at some of the t...

Multimodal

27