Llama 4 Scout: Meta's New MoE Monster, and It Already Runs Locally

Meta dropped Llama 4 this week. Two models: Scout (17B active / 109B total, 16 experts) and Maverick (17B active / 400B total, 128 experts). Both are natively multimodal MoE with a claimed 10 million token context window.

I wanted to know one thing: can I run it locally, today, on AMD hardware?

The answer is yes — and faster than I expected.

What Is Llama 4 Scout?

Quick specs for the people who skip intros:

Architecture: Mixture-of-Experts, 16 experts, 17B active per token, 109B total
Context: 10M tokens (yes, really)
Modalities: Text + vision (text-only in GGUF for now)
License: Llama 4 Community License (more permissive than Llama 3, commercial use allowed up to 700M MAU)
Benchmark claims: Beats Gemma 3 27B, Mistral 3.1 on most tasks; competitive with GPT-4o mini

The key number is 17B active parameters. That’s what actually runs during inference. The 109B figure is the full model including all experts — you only fire a fraction of them per token. Same trick as DeepSeek, Qwen3.5, our local Kimi-Linear. MoE is the dominant architecture for running large-but-fast models locally right now.

The GGUF Situation

Unsloth had GGUFs live within a day of the release. No Q4_K_M — they’re using their Dynamic v2.0 format:

Quantization	Size	Accuracy
UD-IQ1_S	33.8GB	Ok
UD-IQ1_M	35.4GB	Fair
UD-IQ2_XXS	38.6GB	Better
UD-Q2_K_XL	42.2GB	Suggested
UD-Q3_K_XL	52.9GB	Great
UD-Q4_K_XL	65.6GB	Best

Text-only for now. Vision support in GGUF is coming but not yet shipped.

Does llama.cpp Support It?

This was my first question. A new architecture means a new model type in llama.cpp — it can’t just magically load something it doesn’t understand.

I checked our build. LLM_ARCH_LLAMA4 is already in the source: Scout (17B/16E) and Maverick (17B/128E) are both defined, the chat template is wired in, the MoE router logic is there. This architecture landed in llama.cpp very quickly — within days of the model release.

So: no rebuild required if you’re on a recent build.

Hardware Fit (AMD Strix Halo)

AMD Ryzen AI Max+ 395 with 128GB unified memory. The GPU and CPU share the same pool — no PCIe transfers, no separate VRAM limit. Everything sits in one flat address space.

At 65.6GB for Q4_K_XL, Scout loads comfortably with ~60GB left for KV cache and activations. Even Q3_K_XL at 52.9GB gives you a generous buffer for larger context windows. Maverick at Q4_K_XL would be significantly larger (128 experts vs 16) — need to verify size, but likely 250GB+, which doesn’t fit on a single 128GB machine.

Scout was clearly designed to be the consumer-friendly variant.

What I Haven’t Done Yet

I’m writing this before benchmarking. I don’t have numbers yet — this post is about the access question, not performance. What I know from the architecture:

17B active parameters should be noticeably faster than 35B-active models like Mistral Small 4
The 10M context window is architecturally native (sliding window attention + full attention hybrid called iRoPE) — not a fine-tune trick
Tool calling support: unknown, needs testing. This matters a lot for my workflows.

The actual speed test, tool-calling evaluation, and quality benchmarks are coming. But first I want to spend some time with it.

The Bigger Picture

This release is meaningful for a few reasons.

First, Meta is shipping competitive frontier-class models as open weights again. Llama 3 was good. The Llama 3.x line was incremental. Llama 4 Scout, based on early reports, is genuinely in the conversation with the best proprietary models at instruction-following tasks. The community benchmarks are early, but they look promising.

Second, MoE has clearly won for local inference. Every model I’m running locally is MoE now: Qwen3.5-35B-A3B (3B active), Kimi-Linear-48B (3B active), Llama 4 Scout (17B active). The days of running dense 13B models as your “capable local model” feel like a long time ago.

Third — and this is the hardware angle — the unified memory architecture of Strix Halo turns out to be exactly right for this moment. When models ship as 65GB GGUFs, you want that to be a non-event. On discrete GPU hardware with 24GB of VRAM, you’re splitting shards across CPU and GPU with bandwidth penalties. Here it just loads.

I’ll run actual benchmarks once I’ve downloaded Q4_K_XL and let it warm up. Speed (tok/s), tool-calling reliability, coding quality — the same battery I ran on Mistral Small 4 and gpt-oss.

If the numbers hold up, this might be the new default model for tasks that need more than 35B-quality reasoning.

Stay tuned.