Google DeepMind dropped Gemma 4: four sizes, Apache 2.0, multimodal, and llama.cpp support landing fast enough for local testing. I’m running the 26B A4B variant locally on AMD Ryzen AI Max+ hardware. Here’s what matters.

The lineup

Four models, all base and instruction-tuned:

  • E2B — 2.3B effective params, 5.1B with embeddings, 128K context, audio support
  • E4B — 4.5B effective params, 8B with embeddings, 128K context, audio support
  • 31B — 31B dense, 256K context
  • 26B A4B — MoE, 4B active / 26B total, 256K context

The one I care about is the 26B A4B. It sits at LMArena score 1441 — just 11 points below the 31B dense model (1452). With 4B active parameters. That’s the headline.

Why 4B active matters

In a Mixture-of-Experts model, you load all the weights into memory but only activate a fraction of them per token. The 26B A4B loads 16GB (at UD-Q4_K_M quantization) but the compute cost per token is roughly equivalent to a 4B dense model.

For local inference, what that translates to is:

  • Memory: you need 16GB headroom — fine on my hardware with 128GB unified RAM
  • Speed: bandwidth-bound inference with only ~2.5GB of weights being “hot” per token

On my AMD Ryzen AI Max+, I’d estimate 70-100 tok/s for the 26B A4B once the rebuild is done. Compare that to devstral (14B active from a 24B MoE, runs at ~45 tok/s on the same hardware) — Gemma 4’s active slice is smaller, so it should be noticeably faster.

The live setup now has the model wired through llama.cpp/llamactl, so the important question is no longer whether it launches — it’s how useful that 4B-active compute profile feels in practice.

Architecture changes worth knowing

Gemma 4 introduced a few things that aren’t just buzzwords:

Per-Layer Embeddings (PLE) — a second embedding table that feeds a small residual signal into every decoder layer. The idea is to give the model more nuanced token representations at each depth. The small variants (E2B, E4B) use this to punch above their weight class.

Shared KV Cache — the last N transformer layers reuse key-value states from earlier layers, eliminating redundant KV projections. This cuts memory and speeds up inference, especially for long contexts.

Alternating attention — sliding-window local attention (512 or 1024 token windows) alternates with global full-context attention. Classic efficiency trick, well-implemented here.

Variable aspect ratio vision — the image encoder can tokenize to different budgets (70, 140, 280, 560, or 1120 tokens), letting you trade off speed vs. quality for vision tasks. This is more practical than a fixed-size patch approach.

The multimodal angle

All Gemma 4 models support image + text input. The E2B and E4B also support audio. The 26B A4B handles images and video (as frame sequences) but not audio.

For my use case — running a local AI assistant — image support in the 26B A4B is interesting. The 16GB GGUF I already have is text-only; you also need to download a mmproj-BF16.gguf vision encoder (~1-2GB) to unlock that. I haven’t pulled it yet.

The blocker: llama.cpp

The 26B A4B GGUF I downloaded today fails to load on our current llama.cpp build (b8559) with unknown model architecture: 'gemma4'. Support landed in master (commits b069b10ab and 5208e2d5b) within the last 24 hours.

We’re 101 commits behind. A rebuild is overdue anyway — there’s also Llama 4 Scout support, Vulkan MoE GEMV optimizations, and reasoning_format=none for gpt-oss models in that backlog.

Once the rebuild is done, I’ll run the actual benchmarks and update this post.

First impressions without benchmarks

Even without numbers, a few things stand out:

  1. The score/cost tradeoff is real. 1441 LMArena at 4B active params is better than anything we’ve seen at this efficiency tier. qwen35-35b (our current workhorse) scores around 1350 and runs with 3.5B active params — similar active compute, but Gemma 4’s absolute quality is higher.

  2. Apache 2.0. Google keeps shipping fully open licenses. This matters for derivative work, fine-tuning, commercial use.

  3. Day-0 ecosystem. llama.cpp, MLX, transformers, WebGPU — all shipped same day. Google coordinated this well.

  4. 256K context. We’re not context-limited anymore for most tasks.

The question is tool-calling reliability. That’s what actually determines whether a model is useful for agentic workflows. I’ll test that specifically once we’re running.


Draft. Benchmarks pending llama.cpp rebuild. Will update with real tok/s numbers and tool-calling pass rate.