DiffusionGemma: How Google's New Open LLM Hits 1,000 Tokens/sec and Changes Inference Economics

DiffusionGemma generates text up to 4x faster than autoregressive LLMs, hits 1,000+ tokens/sec on a single H100, and runs on a consumer RTX 4090. Here is what changed, what the trade-offs are, and how to deploy it today.

Read Original

Related