Which Local LLM Is Fastest on Ryzen AI Max+ 395? I Benchmarked 10 of Them
The Ryzen AI Max+ 395 (Strix Halo) has 128GB of unified memory that the GPU can access at full bandwidth. That’s a completely different proposition from any previous consumer hardware. I wanted to know: what does that actually mean for local LLM inference speed?
So I ran all the models I had through a standardised benchmark. Here’s what the numbers look like.
The Setup
- Hardware: AMD Ryzen AI Max+ 395, 128GB LPDDR5x-8000 unified memory
- Inference: llama.cpp (b8123) with Vulkan GPU backend
- Router mode: All models loaded on-demand, LRU cache of 3
- Test task: Summarise a ~4,400-word document (7,624 tokens after prompt formatting)
- Metrics captured: Prefill speed (reading the prompt), generation speed (writing the output), wall time
I tracked two different speeds because they tell different stories. Prefill is parallelisable — the GPU reads your input tokens simultaneously. Generation is sequential — one token at a time. For interactive use you feel generation speed most. For long-context RAG or document analysis, prefill speed matters too.
Results
| Model | Params | Active | Prefill tok/s | Gen tok/s | Wall time |
|---|---|---|---|---|---|
| LFM2-24B-A2B Q4_K_M | 24B | 2B | 1190 | 109 | 20s |
| gpt-oss-20b Q8_0 | 20B | 3.6B | 1040 | 70 | 24s |
| Nemotron-3-Nano-30B Q4_K_M | 30B | ~4B | 798 | 61 | 41s |
| Qwen3-Coder-30B-A3B Q4_K_M | 30B | 3B | 779 | 66 | 31s |
| Qwen3.5-35B-A3B-UD Q4_K_M | 35B | 3B | 691 | 49 | 50s |
| Qwen3.5-35B-A3B Q4_K_M | 35B | 3B | 667 | 39 | 38s |
| GLM-4.7-Flash Q5_K_M | 30B | ~4B | 484 | 48 | 53s |
| gpt-oss-120b Q4_K_M | 120B | ~12B | 404 | 52 | 74s |
| Devstral-Small-2 Q4_K_M | 24B | 24B | 281 | 14 | 70s |
Wall time includes model loading if not already cached. Repeated runs on a warm model are faster.
The Surprise Winner: LFM2-24B
LFM2-24B from Liquid AI hit 109 tok/s generation and 1190 tok/s prefill. That’s not a typo. It processed 7,624 input tokens in under 7 seconds and finished the whole task in 20 seconds.
For context: that’s faster than models with 10x fewer active parameters. The LFM2 architecture (Liquid Foundation Model) isn’t a transformer at all — it’s based on liquid neural networks, a completely different paradigm. The MoE variant here has only 2B active parameters per forward pass despite 24B total weights.
The practical result: it feels instant. You get responses before you’ve finished reading the prompt.
Why the 120B Model Isn’t as Slow as You’d Expect
GPT-OSS 120B generates at 52 tok/s despite being 120B parameters. That’s faster than Devstral at 24B.
The reason is MoE (Mixture of Experts). GPT-OSS 120B activates roughly 12B parameters per token — the rest of the weights don’t participate in the computation. Memory bandwidth determines generation speed more than compute, so fewer active parameters = faster generation, regardless of total model size.
Devstral pays the price for being a dense model. Every one of its 24B parameters participates in every token. Dense transformers are inherently bandwidth-limited at generation time.
The Prefill Story
Prefill speed scales roughly with how much memory bandwidth the GPU can throw at the input tokens. The numbers here:
- LFM2: 1190 tok/s — 128GB unified memory at near-peak bandwidth
- GPT-OSS 20B: 1040 tok/s — similarly efficient
- Nemotron/Qwen3-Coder: 750-800 tok/s — slightly more parameters per layer
- Qwen3.5-35B variants: 650-690 tok/s — heavier
- GPT-OSS 120B: 404 tok/s — even activating only 12B, it’s a lot of memory to touch
- Devstral: 275-283 tok/s — dense, no skipping
For a RAG pipeline reading a 50-page document before answering a question, LFM2 finishes the read phase 4x faster than Devstral.
Summary Quality
Speed is one dimension. Coherence is another. Here’s a quick impression from the summaries each model produced (all summarising the same ~4,400-word document):
Excellent: gpt-oss-120b (comprehensive, well-structured, caught every key point), gpt-oss-20b (equally coherent, just more concise)
Very good: Nemotron-30B (clear organisation, good detail level), Devstral (structured headers, accurate coverage)
Good: Qwen3-Coder-30B (solid but slightly brief), Qwen3.5-35B (dense, technical accuracy)
Needs tuning: LFM2-24B (slightly confused author identity — minor hallucination), GLM-4.7-Flash (output format issue with my prompt — speed data is valid, quality needs retesting)
So the fastest model doesn’t write the prettiest summaries. GPT-OSS 20B remains the sweet spot: 70 tok/s generation, 82% on capability benchmarks, excellent prose quality. LFM2 wins when you need volume — bulk processing, rapid search, or real-time workflows where latency is the constraint.
What Failed
A few models errored out:
- Qwen3.5-122B (70GB+ model): connection aborted — likely hitting a timeout during the loading phase rather than OOM, since 128GB should fit it. Worth testing in isolation with a dedicated llama-server instance.
- Kimi-Linear-48B: similar connection abort — SSM/attention hybrid architecture may behave differently under the router
- TeichAI reasoning distills (8B/14B): model name mismatch in my router config — user error, not the model’s fault. Will fix and retest.
Architecture Takeaways
From this data, a few things are clear:
-
MoE wins on inference hardware with large memory. When you have 128GB to play with, the question isn’t “will it fit” but “how many active parameters per forward pass.” Smaller active parameter counts = faster generation.
-
Prefill is memory bandwidth, generation is memory bandwidth. Both metrics scale with how efficiently the architecture accesses weights. Unified memory at LPDDR5x-8000 bandwidth is fast enough that even 120B models feel usable.
-
Architecture beats parameter count. LFM2-24B is faster than GPT-OSS-120B on generation and prefill. The model family matters more than the headline number.
-
Dense models are bandwidth-limited at every token. Devstral’s 13-14 tok/s isn’t a weakness of the model — it’s the inherent cost of dense transformer generation. For coding tasks where output quality matters more than speed, it’s still a valid choice.
What I’d Test Next
- Qwen3-Coder-Next (80B-A3B): 3B active parameters in an 80B MoE. If the pattern holds, this should hit 80-100 tok/s generation — Qwen’s answer to LFM2 for coding.
- Qwen3.5-122B in isolation: load it fresh with a standalone server, no router overhead, to see if the 74GB model actually runs cleanly on 128GB.
- TeichAI reasoning distills: fix the model names, rerun. 14B at Q8 = 14GB, should be fast.
The raw data and model cards are in my notes. Numbers will drift slightly between runs depending on what else is loaded — treat these as ballpark figures, not silicon truth.
If you’re running similar hardware and getting different numbers, I’d be curious what you’re seeing.