Ollama v0.31.1 Boosts Local AI Performance on Apple Silicon
Ollama v0.31.1 makes Gemma 4 about 90% faster on Apple Silicon via multi-token prediction, advancing local AI performance and privacy.
Local AI Performance Takes a Leap on Apple Silicon
For anyone who runs models on their own machine, the newest Ollama update is worth a close look. Released on June 30, 2026, Ollama v0.31.1 delivers a striking gain in local AI performance: running Gemma 4 on Apple Silicon Macs is now roughly 90% faster on average. That figure was measured on a real coding-agent benchmark rather than a synthetic microtest, which makes it especially meaningful for the developers and researchers who lean on local models day to day.
The headline improvement comes from a technique called multi-token prediction, or MTP. To appreciate why it matters, it helps to understand the bottleneck it addresses.
How Multi-Token Prediction Speeds Things Up
Traditional language-model inference is autoregressive: the model produces one token, feeds it back in, and produces the next, over and over. Each step waits on the one before it, so throughput is limited by how quickly the hardware can complete that serial loop. On a laptop, that serial dependency is often the dominant cost.
Multi-token prediction changes the pattern. Instead of committing to a single token per step, MTP drafts several candidate tokens at once and then verifies them together. When the drafts hold up, the model effectively advances several positions in the time it previously took to move one. Ollama's implementation adds a thoughtful refinement: it auto-tunes the drafting depth at runtime, adjusting how many tokens to speculate based on what the workload actually supports. Crucially, it ships enabled by default with zero configuration, so users benefit immediately without editing a single setting.
Just as important is what MTP does not change. The technique preserves exact model output, meaning the text you get is identical to what the model would have produced token by token. There is no quality tradeoff here, no quantization surprise, no drift. You get the same answers, sooner.
Better Engines and a Broader Model Library
The release extends beyond MTP. Ollama refined its MLX engine with a new small-batch matrix-multiplication kernel, a low-level optimization that helps the common case of single-user, interactive sessions where batches are small. The team also updated the underlying llama.cpp engine, keeping the broader compatibility layer current.
On top of the speed work, v0.31.1 widens what you can run locally, adding support for more than fifteen new models. Among them are Kimi-K2.6, GLM-5.1, several DeepSeek variants, and additions to the Qwen family. That growing catalog reinforces the core appeal of self-hosted AI: capable models running on hardware you control.
Why Running Models Locally Still Matters
The practical payoff extends past raw speed. Local inference keeps your prompts and data on your own device, which is a meaningful advantage for privacy-sensitive work, offline environments, and anyone who simply prefers not to route every query through a remote service. Historically, the tradeoff was performance: local models felt slower than their cloud-hosted counterparts. Gains like this one narrow that gap considerably.
For a developer running a coding agent on a MacBook, a 90% average speedup can be the difference between a tool that feels sluggish and one that feels genuinely responsive. And because the improvement arrives automatically, the barrier to enjoying it is essentially zero. That combination, faster results with no loss of fidelity and no configuration burden, is exactly the kind of substantive progress that makes self-hosted AI more practical for everyday use.
Sources: Ollama GitHub release notes, June 30, 2026; ExplainX, June 2026; Ollama announcement, June 2026.
