Google unveils DiffusionGemma, delivering up to 4x faster inference on dedicated GPUs

Google has released DiffusionGemma, a new experimental open model employing text diffusion to accelerate text generation tasks. Departing from traditional sequential token-by-token methods in large language models, DiffusionGemma generates entire blocks of text at once. This leads to a significant boost in speed, with the model delivering up to four times faster output on dedicated GPUs, reaching 1,000 tokens per second on an NVIDIA H100 and 700 on a GeForce RTX 5090. Building on the Gemma 4 architecture and Gemini Diffusion research, DiffusionGemma introduces a novel diffusion head to maximize text generation speed. The 26 billion parameter Mixture of Experts (MoE) design activates only 3.8 billion parameters during inference, making it feasible to run on high-end consumer GPUs with just 18 GB of video memory when quantized. The model supports bidirectional attention by generating 256 tokens in parallel, allowing every token to interact with all others, which is especially useful for ...

Read Original

Related

Product Hunt tool 10h ago

Are you in the Weights?

Find out if you live forever in the brain of the LLMs Discussion | Link

Product Hunt tool 13h ago

GitSync for macOS

Visual GitHub management directly from a graphical interface Discussion | Link