#Multimodal | AI Hub

Mastodon discussion May 13

Safe Pro Group files groundbreaking patent for AI-powered computer vision technology, enhancing small object detection i...

Safe Pro Group files groundbreaking patent for AI-powered computer vision technology, enhancing small object detection in drone imagery with advanced algorithmic approach. Advancin...

Multimodal

18

NewsData.io news May 13

Rivian Rolls Out AI Assistant That Understands Context, Brings Multimodal Function to Its EVs

The Rivian Assistant is now rolling out to its electric vehicles, offering wide understanding, multimodal function, and more features.

Multimodal

21

Papers with Code paper May 13

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video ...

Multimodal

21

Papers with Code paper May 13

When Vision Speaks for Sound

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate ...

OpenAI Google Multimodal

21

Mastodon discussion May 12

Just watched this powerful Sam Altman interview 🔥Sam shares his bold vision for AI — from unlocking massive entrepre...

**Just watched this powerful Sam Altman interview 🔥**Sam shares his bold vision for AI — from unlocking massive entrepreneurship and new scientific breakthroughs to reshaping medic...

OpenAI Multimodal

18

Mastodon discussion May 12

副艦長としてはMetaは見逃せませんですApple's Next Vision Pro Headset Is Reportedly Years Away https://www.cnet.com/tech/mobile/apples-nex...

副艦長としてはMetaは見逃せませんですApple's Next Vision Pro Headset Is Reportedly Years Away https://www.cnet.com/tech/mobile/apples-next-vision-pro-headset-years-away/#Apple #LLM #news #bot

Multimodal

27

Mastodon discussion May 12

Adopting a #human developmental visual diet yields robust and shape-based #AI vision https://www.nature.com/articles/s42...

Adopting a #human developmental visual diet yields robust and shape-based #AI vision https://www.nature.com/articles/s42256-026-01228-6

Multimodal

24

Mastodon discussion May 12

先、冗談でしょう？Apple、Vision Pro後継機は少なくとも2年先か｜開発の軸足はスマートグラスへ https://netaful.jp/apple-vision/0204249.html#Apple #LLM #news #bot

LLM Multimodal

27

Mastodon discussion May 12

亜人も日本には興味があるんですよアップル次期「Vision Pro」は2028年以降？AIペンダントやARグラスに注力か https://japan.cnet.com/article/35247389/#Apple #LLM #news #...

亜人も日本には興味があるんですよアップル次期「Vision Pro」は2028年以降？AIペンダントやARグラスに注力か https://japan.cnet.com/article/35247389/#Apple #LLM #news #bot

LLM Multimodal

27

Mastodon discussion May 12

AirPodsですか。ちょっとシタン先生に相談してみましょうGurman: New Apple Vision Pro Won't Arrive for at Least Two Years https://www.macrumors.com...

AirPodsですか。ちょっとシタン先生に相談してみましょうGurman: New Apple Vision Pro Won't Arrive for at Least Two Years https://www.macrumors.com/2026/05/11/new-vision-pro-wont-arrive-at-least-two-years/#A...

Multimodal

27

Papers with Code paper May 12

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fra...

Multimodal

21

Papers with Code paper May 12

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-...

Multimodal

21

Papers with Code paper May 12

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal...

Multimodal

21

Papers with Code paper May 12

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

Unified multimodal models (UMMs) aim to integrate understanding and generation within a single architecture. However, it remains underexplored how to effectively coordinate these t...

Multimodal

21

Papers with Code paper May 12

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. ...

Multimodal

21

Mastodon discussion May 11

...Valve！これはユグドラシルのみなさんにも教えてあげないとApple Releases visionOS 26.5 https://www.macrumors.com/2026/05/11/apple-releases-vision...

...Valve！これはユグドラシルのみなさんにも教えてあげないとApple Releases visionOS 26.5 https://www.macrumors.com/2026/05/11/apple-releases-visionos-26-5/#Apple #LLM #news #bot

Multimodal

27

Mastodon discussion May 11

https://winbuzzer.com/2026/05/11/gemini-api-file-search-is-now-multimodal-xcxwbn/Google has expanded Gemini API File Sea...

https://winbuzzer.com/2026/05/11/gemini-api-file-search-is-now-multimodal-xcxwbn/Google has expanded Gemini API File Search with multimodal retrieval, metadata filtering, and page ...

Google Multimodal API

24

Dev.to tutorial May 11

Local Multimodal LLM on iOS with `llama.cpp` (Swift + ObjC++)

I want a real local pipeline: image in, structured JSON out, no cloud dependency. Optimized to run...

LLM Multimodal

12

Mastodon discussion May 11

Ultralytics (@ultralytics)Embedded Vision Summit에서 최신 Vision AI 발전과 실시간 데모를 소개하며, 산업 현장에 적용 가능한 생산용 컴퓨터 비전 모델 구축·배포 방법을 ...

Ultralytics (@ultralytics)Embedded Vision Summit에서 최신 Vision AI 발전과 실시간 데모를 소개하며, 산업 현장에 적용 가능한 생산용 컴퓨터 비전 모델 구축·배포 방법을 다룹니다.https://x.com/ultralytics/status/2053532082391880010#vi...

Multimodal

18

Dev.to tutorial May 11

One Open Source Project a Day (No. 62): UI-TARS-Desktop - ByteDance's Open-Source Multimodal GUI Agent Stack

Introduction "See the screen, understand the task, take the action." This is the No.62...

Multimodal Open Source

12

NewsData.io news May 11

Bernie Sanders Says Driverless Vehicles, Jeff Bezos' '$100 Billion' Robot Factory Vision Threaten Millions Of Jobs: 'We Are Not Ready For...'

Sen. Bernie Sanders (I-VT) raised concerns about the advent of physical AI on Sunday, slamming companies like Uber Technologies Inc. (NYSE: UBER ), as well as Amazon.com Inc. (NASD...

Multimodal Robotics

21

Papers with Code paper May 11

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent...

Multimodal

21

Papers with Code paper May 11

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard s...

Multimodal

21

Papers with Code paper May 11

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks l...

Multimodal

21