The Ryzen AI Max+ 395 (Strix Halo) has 128GB of unified memory that the GPU can access at full bandwidth. That’s a completely different proposition from any previous consumer hardware. I wanted to know: what does that actually mean for local LLM inference speed?

So I ran all the models I had through a standardised benchmark. Here’s what the numbers look like.

The Setup

  • Hardware: AMD Ryzen AI Max+ 395, 128GB LPDDR5x-8000 unified memory
  • Inference: llama.cpp (b8123) with Vulkan GPU backend
  • Router mode: All models loaded on-demand, LRU cache of 3
  • Test task: Summarise a ~4,400-word document (7,624 tokens after prompt formatting)
  • Metrics captured: Prefill speed (reading the prompt), generation speed (writing the output), wall time

I tracked two different speeds because they tell different stories. Prefill is parallelisable — the GPU reads your input tokens simultaneously. Generation is sequential — one token at a time. For interactive use you feel generation speed most. For long-context RAG or document analysis, prefill speed matters too.

Results

Model Params Active Prefill tok/s Gen tok/s Wall time
LFM2-24B-A2B Q4_K_M 24B 2B 1190 109 20s
gpt-oss-20b Q8_0 20B 3.6B 1040 70 24s
Nemotron-3-Nano-30B Q4_K_M 30B ~4B 798 61 41s
Qwen3-Coder-30B-A3B Q4_K_M 30B 3B 779 66 31s
Qwen3.5-35B-A3B-UD Q4_K_M 35B 3B 691 49 50s
Qwen3.5-35B-A3B Q4_K_M 35B 3B 667 39 38s
GLM-4.7-Flash Q5_K_M 30B ~4B 484 48 53s
gpt-oss-120b Q4_K_M 120B ~12B 404 52 74s
Devstral-Small-2 Q4_K_M 24B 24B 281 14 70s

Wall time includes model loading if not already cached. Repeated runs on a warm model are faster.

The Surprise Winner: LFM2-24B

LFM2-24B from Liquid AI hit 109 tok/s generation and 1190 tok/s prefill. That’s not a typo. It processed 7,624 input tokens in under 7 seconds and finished the whole task in 20 seconds.

For context: that’s faster than models with 10x fewer active parameters. The LFM2 architecture (Liquid Foundation Model) isn’t a transformer at all — it’s based on liquid neural networks, a completely different paradigm. The MoE variant here has only 2B active parameters per forward pass despite 24B total weights.

The practical result: it feels instant. You get responses before you’ve finished reading the prompt.

Why the 120B Model Isn’t as Slow as You’d Expect

GPT-OSS 120B generates at 52 tok/s despite being 120B parameters. That’s faster than Devstral at 24B.

The reason is MoE (Mixture of Experts). GPT-OSS 120B activates roughly 12B parameters per token — the rest of the weights don’t participate in the computation. Memory bandwidth determines generation speed more than compute, so fewer active parameters = faster generation, regardless of total model size.

Devstral pays the price for being a dense model. Every one of its 24B parameters participates in every token. Dense transformers are inherently bandwidth-limited at generation time.

The Prefill Story

Prefill speed scales roughly with how much memory bandwidth the GPU can throw at the input tokens. The numbers here:

  • LFM2: 1190 tok/s — 128GB unified memory at near-peak bandwidth
  • GPT-OSS 20B: 1040 tok/s — similarly efficient
  • Nemotron/Qwen3-Coder: 750-800 tok/s — slightly more parameters per layer
  • Qwen3.5-35B variants: 650-690 tok/s — heavier
  • GPT-OSS 120B: 404 tok/s — even activating only 12B, it’s a lot of memory to touch
  • Devstral: 275-283 tok/s — dense, no skipping

For a RAG pipeline reading a 50-page document before answering a question, LFM2 finishes the read phase 4x faster than Devstral.

Summary Quality

Speed is one dimension. Coherence is another. Here’s a quick impression from the summaries each model produced (all summarising the same ~4,400-word document):

Excellent: gpt-oss-120b (comprehensive, well-structured, caught every key point), gpt-oss-20b (equally coherent, just more concise)

Very good: Nemotron-30B (clear organisation, good detail level), Devstral (structured headers, accurate coverage)

Good: Qwen3-Coder-30B (solid but slightly brief), Qwen3.5-35B (dense, technical accuracy)

Needs tuning: LFM2-24B (slightly confused author identity — minor hallucination), GLM-4.7-Flash (output format issue with my prompt — speed data is valid, quality needs retesting)

So the fastest model doesn’t write the prettiest summaries. GPT-OSS 20B remains the sweet spot: 70 tok/s generation, 82% on capability benchmarks, excellent prose quality. LFM2 wins when you need volume — bulk processing, rapid search, or real-time workflows where latency is the constraint.

What Failed

A few models errored out:

  • Qwen3.5-122B (70GB+ model): connection aborted — likely hitting a timeout during the loading phase rather than OOM, since 128GB should fit it. Worth testing in isolation with a dedicated llama-server instance.
  • Kimi-Linear-48B: similar connection abort — SSM/attention hybrid architecture may behave differently under the router
  • TeichAI reasoning distills (8B/14B): model name mismatch in my router config — user error, not the model’s fault. Will fix and retest.

Architecture Takeaways

From this data, a few things are clear:

  1. MoE wins on inference hardware with large memory. When you have 128GB to play with, the question isn’t “will it fit” but “how many active parameters per forward pass.” Smaller active parameter counts = faster generation.

  2. Prefill is memory bandwidth, generation is memory bandwidth. Both metrics scale with how efficiently the architecture accesses weights. Unified memory at LPDDR5x-8000 bandwidth is fast enough that even 120B models feel usable.

  3. Architecture beats parameter count. LFM2-24B is faster than GPT-OSS-120B on generation and prefill. The model family matters more than the headline number.

  4. Dense models are bandwidth-limited at every token. Devstral’s 13-14 tok/s isn’t a weakness of the model — it’s the inherent cost of dense transformer generation. For coding tasks where output quality matters more than speed, it’s still a valid choice.

What I’d Test Next

  • Qwen3-Coder-Next (80B-A3B): 3B active parameters in an 80B MoE. If the pattern holds, this should hit 80-100 tok/s generation — Qwen’s answer to LFM2 for coding.
  • Qwen3.5-122B in isolation: load it fresh with a standalone server, no router overhead, to see if the 74GB model actually runs cleanly on 128GB.
  • TeichAI reasoning distills: fix the model names, rerun. 14B at Q8 = 14GB, should be fast.

The raw data and model cards are in my notes. Numbers will drift slightly between runs depending on what else is loaded — treat these as ballpark figures, not silicon truth.

If you’re running similar hardware and getting different numbers, I’d be curious what you’re seeing.