Google shipped Multi-Token Prediction drafters for Gemma 4 on May 5, 2026. The technique boosts decode throughput up to 3x with no degradation in output quality, and it runs on the same Ollama installation many of you already use for local inference. That combination matters because it changes the unit economics of running open-weight models on commodity hardware. A free 3x speedup is rare. A free 3x speedup that is lossless and ships under Apache 2.0 is rarer.
Subscribe to the newsletter for more local AI deep dives.
What Changed in Gemma 4
Multi-Token Prediction is a flavor of speculative decoding. The idea is simple and old. A small drafter model proposes several future tokens cheaply, then the heavier target model verifies the proposal in a single forward pass. If the drafter guessed right, you collected several tokens for the cost of one. If the drafter guessed wrong, you fall back to the target model and lose nothing in quality.
Google's contribution is to release official drafters trained jointly with each Gemma 4 variant. The 31B dense flagship pairs with google/gemma-4-31B-it-assistant. The 26B A4B Mixture-of-Experts model pairs with google/gemma-4-26B-A4B-it-assistant. The on-device E2B and E4B variants have their own drafters in the same naming scheme. Each drafter is small enough to load alongside the target without consuming meaningful extra memory, which is the whole point.
What does not change is quality. The verification step is mathematically lossless because the target model retains the final say on every accepted token. What does not change is memory footprint either. You still need to fit the target model in VRAM or unified memory before you serve a token. MTP makes existing local hardware faster. It does not let you run a model your hardware could not load before.
Benchmarks: Gemma 4 vs Qwen and Llama on Ollama
Three numbers matter when choosing an open-weight model in 2026. Tokens per second on your hardware. Quality on the benchmarks that match your workload. Memory floor for the size you want to run. Here is the honest picture for the Ollama-on-consumer-hardware audience.
| Model | Size | MMLU Pro | HumanEval | Memory (Q4) | RTX PRO 6000 tok/s |
|---|---|---|---|---|---|
| Gemma 4 31B | 31B dense | 85.2 | 78.5 | ~22 GB | 58 |
| Gemma 4 31B + MTP | 31B + drafter | 85.2 | 78.5 | ~24 GB | ~165 |
| Qwen 3.6 (27B class) | 27B | 84.1 | 92.1 | ~18 GB | 62 |
| Qwen 3.5 35B-A3B (MoE) | 35B / 3B active | 83.8 | 90.4 | ~24 GB | 71 |
| Llama 3.3 70B | 70B dense | 86.3 | 80.7 | ~42 GB | 28 |
A few honest readings from this table. Gemma 4 31B is competitive on general reasoning but trails Qwen 3.6 on HumanEval by roughly fourteen points. MTP closes the throughput gap dramatically. A 165 tokens-per-second decode on a 31B dense model used to require a server-class GPU and a custom serving stack. It now runs on a single workstation card with stock Ollama, no kernel tricks required.
The Apple Silicon picture is brighter than it sounds on paper. The creator of llama.cpp reportedly hit 300 tokens per second with Gemma 4 on a three-year-old Mac Studio. The 26B A4B MoE on Apple Silicon is the under-the-radar pick. It activates only 4B parameters per token, so latency feels closer to a 7B dense model while quality sits near the 31B tier.
If you want the deeper pricing breakdown across Ollama Cloud Pro, Pro Max, and self-hosted GPU rigs, see the Ollama Cloud pricing and hardware guide for the full crossover math.
Running Gemma 4 Locally with Ollama
The basic install is one command. Pull the size you want and Ollama handles the rest:
ollama pull gemma4 # 31B dense flagship
ollama pull gemma4:e2b # 2.3B effective, edge / mobile
ollama pull gemma4:e4b # 4.5B effective, laptop sweet spot
ollama pull gemma4:26b-a4b # 26B MoE, 4B active per tokenMemory footprint at Q4_K_M scales roughly with parameter count. The E2B variant fits in 4 GB of unified RAM or VRAM and runs comfortably on a recent iPhone or any modern laptop. E4B doubles that. The 26B A4B model loads 20 GB into memory but the active-parameter math makes throughput feel like a far smaller model. The 31B dense flagship is the production sweet spot at roughly 22 GB.
MTP speculative decoding requires Ollama v0.23.1 or newer. The MLX runner on Mac shipped first. Other runners are being validated as of late May 2026, so check ollama --version before assuming MTP is wired up on your platform. The new Modelfile syntax adds a DRAFT directive for pairing a target with its drafter, plus a --quantize-draft flag on ollama create so the drafter inherits your quantization choice without a separate conversion step.
The OpenAI-compatible API surface is unchanged. Point your existing client at http://localhost:11434/v1, set the model to gemma4, and the rest of your code works as written. That portability is the entire reason Ollama owns this corner of local AI.
Where Gemma 4 Wins
Latency-sensitive surfaces are where MTP earns its keep. Chat UI feels different at 165 tokens per second than at 55. Voice assistants stop waiting on the model and start waiting on the speaker. Autocomplete that streams under 50 milliseconds per chunk crosses the threshold where users stop noticing it exists, which is the point.
Edge and on-device deployment is the second clean win. The E2B and E4B variants ship with the same MTP drafter pattern, which means a phone or a Raspberry Pi 5 can run a useful model at usable speed. Google's AI Edge Gallery distribution makes this turnkey for Android. For desktop developers, the same model file runs unchanged on a laptop and a phone, which is the kind of portability that makes a product roadmap feel cheaper to build.
Batch summarization and RAG synthesis are the third place Gemma 4 fits well. These workloads send long context windows through the model and return short outputs. The 256K context on the 26B and 31B variants covers nearly every retrieval pipeline a small team will build. The MTP speedup compounds across millions of nightly batch jobs in a way that is hard to ignore on a cloud bill or a power bill.
Where It Does Not
Heavy reasoning chains still favor the bigger reasoning-tuned models. If your workload involves long chain-of-thought on novel problems, Claude Opus 4.6 and GPT-5.3-Codex still pull ahead by a noticeable margin. The gap is narrowing every quarter, but for production agents that need to plan four steps ahead and back-track, you want frontier reasoning on the critical path, not Gemma 4.
Coding agents that have to ship working pull requests are the second exception. Qwen 3.6 wins HumanEval by fourteen points and SWE-bench Verified by even more on the A3B MoE variant. For a coding agent doing real work on a real repo, that delta shows up as merged PRs versus stalled drafts. Pick Qwen for the coding loop and reach for Gemma 4 only when you want a fast generalist alongside it.
Multimodal vision workloads currently belong to other model families. Gemma 4 has vision variants in the broader family but the MTP drafter ecosystem is text-first for now. Qwen 2.5-VL, Llama 3.2 Vision, and the Phi-4 multimodal line are the right defaults until Google ships official vision drafters in the Gemma 4 lineup.
A Practical Decision Rule
Here is the short version of the choice, framed as a decision tree.
- Building latency-sensitive UI on a laptop or phone. Gemma 4 E2B or E4B with MTP. Free 2-3x speedup, fits in unified RAM, and the on-device deployment story works without a server.
- Building a startup MVP that needs a fast general model on a workstation. Gemma 4 31B with MTP on Ollama. Roughly 165 tokens per second on an RTX PRO 6000 Blackwell, half that on a consumer 4090, still fast enough for streaming UX.
- Building a coding agent that has to ship PRs. Qwen 3.6 27B or Qwen 3.5 35B-A3B. Skip Gemma 4 here unless you are bundling it as a fast secondary model for low-stakes generations.
- Building a hybrid stack with cloud fallback. Pair a local Gemma 4 31B + MTP setup with Ollama Cloud Pro Max as a burst lane. The portability between local and cloud is the entire point of staying inside the Ollama API surface.
Related Reading
For the deeper Ollama economics question of when self-hosting beats Cloud Pro Max, see Ollama Cloud Pricing and Hardware Requirements 2026. For the coding-model showdown that motivates the Qwen recommendation above, see Local AI Coding Models on Ollama: Qwen, DeepSeek, and the 2026 Landscape. For an agent framework that pairs cleanly with Gemma 4 on a local runtime, see Hermes Agent + Ollama: Building a Local AI Agent in 2026. For the broader picture of agent frameworks running on local LLMs, see AI Agents and Frameworks on Local LLMs.
Closing Numbers
Gemma 4 31B with MTP serves roughly 165 tokens per second on an RTX PRO 6000 Blackwell and around 300 on a tuned Mac Studio. The 26B A4B Mixture-of-Experts variant punches above its memory footprint thanks to 4B active parameters per token. The on-device E2B and E4B variants give phones and laptops a real text model with a real speedup. None of this costs anything beyond your existing hardware and a fresh ollama pull.
If your AI stack is still paying per-token rates for low-stakes generations, move them to Gemma 4 with MTP this week and measure the result. The 3x speedup is the headline. The Apache 2.0 license, the unchanged API surface, and the same ollama pull command you already type are why this one actually ships into production.
Subscribe for the next deep dive on running production agents on a hybrid local plus cloud stack.
