#Multimodal | AI Hub

Papers with Code paper Jul 4

Attending to Multimodal Generation One Token at a Time

Multimodal large language models (MLLMs) generate responses autoregressively, integrating visual and linguistic information in an evolving context. Prior work on interpretability h...

Multimodal

21

Mastodon discussion Jul 3

Can GPT-4o give a robot feelings? :)Coralie Deplanne and Anne-Charlotte Passanisi recorded 80+ emotions (motion + sound)...

Can GPT-4o give a robot feelings? :)Coralie Deplanne and Anne-Charlotte Passanisi recorded 80+ emotions (motion + sound) by teleoperating Reachy Mini, true artists. Then we plugged...

Multimodal Robotics

9

GitHub Trending repo Jul 3

ridhima183/AIML--learning-journey: Google AI Essentials covered AI fundamentals, machine learning basics, productivity tools, prompting techniques, and responsible AI use across 5 modules. Google Prompting Essentials built on this with a structured 5-step prompting framework — task, context, references, evaluate, iterate — plus advanced techniques like prompt chaining and multimodal

Google AI Essentials covered AI fundamentals, machine learning basics, productivity tools, prompting techniques, and responsible AI use across 5 modules. Google Prompting Essential...

Google Multimodal

41

GitHub Trending repo Jul 3

SiriNandinii/Dog-Breed-Classification: tensorflow keras deep-learning computer-vision image-classification transfer-learning xception cnn machine-learning python kaggle image-recognition

tensorflow keras deep-learning computer-vision image-classification transfer-learning xception cnn machine-learning python kaggle image-recognition

Multimodal

41

GNews news Jul 2

PM Modi urges Japanese firms to double presence in India, unveils vision for strategic tech alliance

Speaking at the India-Japan Joint Economic Forum, Modi said India and Japan had agreed to strengthen cooperation in strategic sectors, with fresh agreements signed on economic secu...

Multimodal

18

Mastodon discussion Jul 2

NVIDIAがLlama-3.1-Nemotron-70B-InstructをリリースベンチマークでGPT-4oやClaude 3.5 Sonnetを超える｜Ｓｋｙ Tech Blog（スカイテックブログ） https://www.y...

NVIDIAがLlama-3.1-Nemotron-70B-InstructをリリースベンチマークでGPT-4oやClaude 3.5 Sonnetを超える｜Ｓｋｙ Tech Blog（スカイテックブログ） https://www.yayafa.com/2834651/ #AgenticAi #AI #ArtificialGeneralIntellig...

Anthropic Meta NVIDIA

9

Papers with Code paper Jul 2

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot a...

Multimodal Benchmark

21

Mastodon discussion Jul 1

🎆 Independence Day Sale — 50% off ALL courses at OpenCV University, today only.Learn Computer Vision, Deep Learning, PyT...

🎆 Independence Day Sale — 50% off ALL courses at OpenCV University, today only.Learn Computer Vision, Deep Learning, PyTorch, TensorFlow, and Generative AI from the team behind Ope...

Multimodal

9

Mastodon discussion Jul 1

What happens if you attach a multimodal LLM to a small open-source robot?Runs on Reachy Mini, or just in the simulator w...

What happens if you attach a multimodal LLM to a small open-source robot?Runs on Reachy Mini, or just in the simulator without the robot. You install it straight from the Reachy Mi...

LLM Multimodal Open Source

9

NewsData.io news Jul 1

Advanced brain interfacing technology for both touch and vision restoration

Patients with untreatable conditions such as sight loss or loss of motor-function could be closer to a viable technology for restoring their lost sense, within a faster time frame.

Multimodal

21

GitHub Trending repo Jul 1

cheerupzhu/PPM_VLA: This work presents an efficient physics-based Vision-Language-Action (VLA) approach that integrates Vision-Language Models (VLMs) with diffusion models to generate trajectory predictions with enhanced physical realism.

This work presents an efficient physics-based Vision-Language-Action (VLA) approach that integrates Vision-Language Models (VLMs) with diffusion models to generate trajectory predi...

Multimodal

35

GitHub Trending repo Jul 1

AminKhorramii/shade: Browser-native WGSL shader editor with multimodal ML — MediaPipe, depth, diffusion, compute particles, all client-side.

Browser-native WGSL shader editor with multimodal ML — MediaPipe, depth, diffusion, compute particles, all client-side.

Multimodal

38

Dev.to tutorial Jul 1

I Built a Free API That Detects Phishing Sites Using AI Vision - And It Catches Prompt Injection Too

Most phishing detection APIs check URL reputation databases. The problem? Brand new phishing sites...

Multimodal API

12

Papers with Code paper Jul 1

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuanc...

Multimodal

21

Papers with Code paper Jul 1

MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering

As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal ...

Multimodal

21

Mastodon discussion Jun 30

Dogfirmations with Doofie and Dingus: Disbelief Doofie trusts his abilities, even when uncertainty clouds his vision. #D...

Dogfirmations with Doofie and Dingus: Disbelief Doofie trusts his abilities, even when uncertainty clouds his vision. #Disbelief #Technology #AI #Animals #Photography #Dogs #Nature...

Multimodal

30

NewsData.io news Jun 30

Restoring lost senses: One technology for both artificial vision and touch

Patients with untreatable conditions such as sight loss or loss of motor function could be closer to a viable technology for restoring their lost sense within a faster time frame. ...

Multimodal

21

GNews news Jun 30

Sudhanshu Vats Assumes Office as President of the Bombay Chamber, Unveils Vision for an AI

Mumbai (Maharashtra) [India], June 30: The Bombay Chamber of Commerce & Industry, India's oldest chamber of commerce, ushered in a new chapter at its 190th Annual General Meeting, ...

Multimodal

18

GitHub Trending repo Jun 30

VISION-SJTU/Lapis: [ECCV 2026] Efficient and High-Quality Depth Estimation via Pixel-Space Diffusion with Linear Attention

[ECCV 2026] Efficient and High-Quality Depth Estimation via Pixel-Space Diffusion with Linear Attention

Multimodal

35

Papers with Code paper Jun 30

MuSViT: A Foundation Vision Model for Sheet Music Representation

Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding ...

Multimodal

21

Papers with Code paper Jun 30

Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-l...

Multimodal

21

Papers with Code paper Jun 30

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

Recent multimodal large language models have shown great promise in clinical image reasoning, but existing post-training pipelines remain predominantly outcome-centric, relying on ...

Multimodal

21

Papers with Code paper Jun 30

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Recent work in this paradigm u...

Multimodal

21

GNews news Jun 29

South Korea vision for deeper partnership: Ambassador Lee

South Korea has reaffirmed its commitment to expanding ties with India across strategic sectors, including shipbuilding, artificial intelligence, critical minerals and industrial c...

Multimodal

18