What is Multi-Token Prediction in Gemma 4?

Multi-Token Prediction (MTP) is a speculative-decoding technique Google shipped for Gemma 4 on May 5, 2026. A lightweight drafter model proposes several future tokens in the time the heavy target model takes to produce one, then the target verifies all of them in parallel. Google reports up to 3x decode throughput with no loss in output quality because the verification step is lossless. The win depends on hardware and prompt shape, so expect 1.8x to 3x in practice.

How do I run Gemma 4 on Ollama?

Pull the model with `ollama pull gemma4` for the default 31B dense flagship, or pull the on-device variants with `ollama pull gemma4:e2b` and `ollama pull gemma4:e4b`. MTP speculative decoding landed in Ollama v0.23.1 on the MLX runner for Macs first, with the broader runner support being validated. Pair a target model with a drafter using the new `DRAFT` Modelfile directive, then call the OpenAI-compatible endpoint as you would for any other Ollama model.

How does Gemma 4 compare to Qwen and Llama for local use?

Gemma 4 31B scores 85.2 on MMLU Pro and is competitive on general reasoning. Qwen 3.6 still leads on coding benchmarks (92.1 HumanEval vs 78.5 for Gemma 4 31B, and 73.4 SWE-bench Verified for the 35B-A3B variant). Pick Gemma 4 for latency-sensitive chat and edge deployment where MTP buys you free throughput. Pick Qwen 3.5 or 3.6 when raw code-generation quality is the only thing that matters.

What hardware do I need for Gemma 4 locally?

The E2B model fits in 4 GB of unified RAM or VRAM and runs on any modern phone or laptop. The E4B variant needs 8 GB. The 26B A4B Mixture-of-Experts model needs 20 GB to load but only activates 4B parameters per token, which makes it surprisingly fast on Apple Silicon Studio class hardware. The 31B dense flagship needs roughly 22 GB at Q4_K_M and runs around 58 tokens per second on an RTX PRO 6000 Blackwell, less on consumer cards.

Should I use Gemma 4 or a 32B-class model for production?

If your workload is high-volume cheap inference (classification, RAG synthesis, autocomplete, voice), Gemma 4 with MTP is the right default. The 3x decode speed compounds across every request and lowers the unit cost of every token you serve. For heavy reasoning, complex coding agents, and anything that benefits from frontier capability, keep a 32B-plus model or a cloud API in the loop. The hybrid pattern from the existing Ollama pricing breakdown still applies.

Gemma 4 on Ollama 2026 — 3x Faster MTP & Benchmarks

Google shipped Multi-Token Prediction drafters for Gemma 4 on May 5, 2026. The technique boosts decode throughput up to 3x with no degradation in output quality, and it runs on the same Ollama installation many of you already use for local inference. That combination matters because it changes the unit economics of running open-weight models on commodity hardware. A free 3x speedup is rare. A free 3x speedup that is lossless and ships under Apache 2.0 is rarer.

Subscribe to the newsletter for more local AI deep dives.

What Changed in Gemma 4

Multi-Token Prediction is a flavor of speculative decoding. The idea is simple and old. A small drafter model proposes several future tokens cheaply, then the heavier target model verifies the proposal in a single forward pass. If the drafter guessed right, you collected several tokens for the cost of one. If the drafter guessed wrong, you fall back to the target model and lose nothing in quality.

Google's contribution is to release official drafters trained jointly with each Gemma 4 variant. The 31B dense flagship pairs with google/gemma-4-31B-it-assistant. The 26B A4B Mixture-of-Experts model pairs with google/gemma-4-26B-A4B-it-assistant. The on-device E2B and E4B variants have their own drafters in the same naming scheme. Each drafter is small enough to load alongside the target without consuming meaningful extra memory, which is the whole point.

What does not change is quality. The verification step is mathematically lossless because the target model retains the final say on every accepted token. What does not change is memory footprint either. You still need to fit the target model in VRAM or unified memory before you serve a token. MTP makes existing local hardware faster. It does not let you run a model your hardware could not load before.

Benchmarks: Gemma 4 vs Qwen and Llama on Ollama

Three numbers matter when choosing an open-weight model in 2026. Tokens per second on your hardware. Quality on the benchmarks that match your workload. Memory floor for the size you want to run. Here is the honest picture for the Ollama-on-consumer-hardware audience.

Model	Size	MMLU Pro	HumanEval	Memory (Q4)	RTX PRO 6000 tok/s
Gemma 4 31B	31B dense	85.2	78.5	~22 GB	58
Gemma 4 31B + MTP	31B + drafter	85.2	78.5	~24 GB	~165
Qwen 3.6 (27B class)	27B	84.1	92.1	~18 GB	62
Qwen 3.5 35B-A3B (MoE)	35B / 3B active	83.8	90.4	~24 GB	71
Llama 3.3 70B	70B dense	86.3	80.7	~42 GB	28

A few honest readings from this table. Gemma 4 31B is competitive on general reasoning but trails Qwen 3.6 on HumanEval by roughly fourteen points. MTP closes the throughput gap dramatically. A 165 tokens-per-second decode on a 31B dense model used to require a server-class GPU and a custom serving stack. It now runs on a single workstation card with stock Ollama, no kernel tricks required.

The Apple Silicon picture is brighter than it sounds on paper. The creator of llama.cpp reportedly hit 300 tokens per second with Gemma 4 on a three-year-old Mac Studio. The 26B A4B MoE on Apple Silicon is the under-the-radar pick. It activates only 4B parameters per token, so latency feels closer to a 7B dense model while quality sits near the 31B tier.

If you want the deeper pricing breakdown across Ollama Cloud Pro, Pro Max, and self-hosted GPU rigs, see the Ollama Cloud pricing and hardware guide for the full crossover math.

Running Gemma 4 Locally with Ollama

The basic install is one command. Pull the size you want and Ollama handles the rest:

bash

ollama pull gemma4           # 31B dense flagship
ollama pull gemma4:e2b       # 2.3B effective, edge / mobile
ollama pull gemma4:e4b       # 4.5B effective, laptop sweet spot
ollama pull gemma4:26b-a4b   # 26B MoE, 4B active per token

Memory footprint at Q4_K_M scales roughly with parameter count. The E2B variant fits in 4 GB of unified RAM or VRAM and runs comfortably on a recent iPhone or any modern laptop. E4B doubles that. The 26B A4B model loads 20 GB into memory but the active-parameter math makes throughput feel like a far smaller model. The 31B dense flagship is the production sweet spot at roughly 22 GB.

MTP speculative decoding requires Ollama v0.23.1 or newer. The MLX runner on Mac shipped first. Other runners are being validated as of late May 2026, so check ollama --version before assuming MTP is wired up on your platform. The new Modelfile syntax adds a DRAFT directive for pairing a target with its drafter, plus a --quantize-draft flag on ollama create so the drafter inherits your quantization choice without a separate conversion step.

The OpenAI-compatible API surface is unchanged. Point your existing client at http://localhost:11434/v1, set the model to gemma4, and the rest of your code works as written. That portability is the entire reason Ollama owns this corner of local AI.

Where Gemma 4 Wins

Latency-sensitive surfaces are where MTP earns its keep. Chat UI feels different at 165 tokens per second than at 55. Voice assistants stop waiting on the model and start waiting on the speaker. Autocomplete that streams under 50 milliseconds per chunk crosses the threshold where users stop noticing it exists, which is the point.

Edge and on-device deployment is the second clean win. The E2B and E4B variants ship with the same MTP drafter pattern, which means a phone or a Raspberry Pi 5 can run a useful model at usable speed. Google's AI Edge Gallery distribution makes this turnkey for Android. For desktop developers, the same model file runs unchanged on a laptop and a phone, which is the kind of portability that makes a product roadmap feel cheaper to build.

Batch summarization and RAG synthesis are the third place Gemma 4 fits well. These workloads send long context windows through the model and return short outputs. The 256K context on the 26B and 31B variants covers nearly every retrieval pipeline a small team will build. The MTP speedup compounds across millions of nightly batch jobs in a way that is hard to ignore on a cloud bill or a power bill.

Where It Does Not

Heavy reasoning chains still favor the bigger reasoning-tuned models. If your workload involves long chain-of-thought on novel problems, Claude Opus 4.6 and GPT-5.3-Codex still pull ahead by a noticeable margin. The gap is narrowing every quarter, but for production agents that need to plan four steps ahead and back-track, you want frontier reasoning on the critical path, not Gemma 4.

Coding agents that have to ship working pull requests are the second exception. Qwen 3.6 wins HumanEval by fourteen points and SWE-bench Verified by even more on the A3B MoE variant. For a coding agent doing real work on a real repo, that delta shows up as merged PRs versus stalled drafts. Pick Qwen for the coding loop and reach for Gemma 4 only when you want a fast generalist alongside it.

Multimodal vision workloads currently belong to other model families. Gemma 4 has vision variants in the broader family but the MTP drafter ecosystem is text-first for now. Qwen 2.5-VL, Llama 3.2 Vision, and the Phi-4 multimodal line are the right defaults until Google ships official vision drafters in the Gemma 4 lineup.

A Practical Decision Rule

Here is the short version of the choice, framed as a decision tree.

Building latency-sensitive UI on a laptop or phone. Gemma 4 E2B or E4B with MTP. Free 2-3x speedup, fits in unified RAM, and the on-device deployment story works without a server.
Building a startup MVP that needs a fast general model on a workstation. Gemma 4 31B with MTP on Ollama. Roughly 165 tokens per second on an RTX PRO 6000 Blackwell, half that on a consumer 4090, still fast enough for streaming UX.
Building a coding agent that has to ship PRs. Qwen 3.6 27B or Qwen 3.5 35B-A3B. Skip Gemma 4 here unless you are bundling it as a fast secondary model for low-stakes generations.
Building a hybrid stack with cloud fallback. Pair a local Gemma 4 31B + MTP setup with Ollama Cloud Pro Max as a burst lane. The portability between local and cloud is the entire point of staying inside the Ollama API surface.

Closing Numbers

Gemma 4 31B with MTP serves roughly 165 tokens per second on an RTX PRO 6000 Blackwell and around 300 on a tuned Mac Studio. The 26B A4B Mixture-of-Experts variant punches above its memory footprint thanks to 4B active parameters per token. The on-device E2B and E4B variants give phones and laptops a real text model with a real speedup. None of this costs anything beyond your existing hardware and a fresh ollama pull.

If your AI stack is still paying per-token rates for low-stakes generations, move them to Gemma 4 with MTP this week and measure the result. The 3x speedup is the headline. The Apache 2.0 license, the unchanged API surface, and the same ollama pull command you already type are why this one actually ships into production.

Subscribe for the next deep dive on running production agents on a hybrid local plus cloud stack.

Gemma 4 on Ollama: Multi-Token Prediction, Benchmarks, and Local Setup (May 2026)

What Changed in Gemma 4

Benchmarks: Gemma 4 vs Qwen and Llama on Ollama

Running Gemma 4 Locally with Ollama

Where Gemma 4 Wins

Where It Does Not

A Practical Decision Rule

Related Reading

Closing Numbers

Quantitative Market Reports

About Pooya Golchian

Newsletter

What Changed in Gemma 4

Benchmarks: Gemma 4 vs Qwen and Llama on Ollama

Running Gemma 4 Locally with Ollama

Where Gemma 4 Wins

Where It Does Not

A Practical Decision Rule

Related Reading

Closing Numbers

Quantitative Market Reports

About Pooya Golchian

Get practical AI and engineering playbooks

Newsletter