Mistral Small 4: A 119B Model That Fits in 70GB and Actually Runs Fast
Mistral released Small 4 on March 16th. 119 billion parameters, Mixture-of-Experts architecture, 128 experts with 4 active per token (6.5B active parameters), multimodal, 256K context, Apache 2.0. I downloaded the Unsloth UD-Q4_K_XL GGUF (three shards, ~69GB total) and ran it locally.
Here’s what I found.
The Hardware
AMD Ryzen AI Max+ 395 (Strix Halo), 128GB LPDDR5x-8000 unified memory. The GPU shares the same memory pool as the CPU — no PCIe bandwidth bottleneck, no separate VRAM limit. The entire 69GB model loads into memory and the GPU can address all of it directly. This architecture is genuinely unusual and explains why local inference on this hardware punches above its weight.
What Makes Mistral Small 4 Different
The MoE design is the key thing. 119 billion total parameters, but only about 6.5 billion are active during any single inference pass. The model routes each token through 4 of 128 experts. The implication: you get the knowledge capacity of a 119B model at roughly the compute cost of a 7-8B dense model.
The other notable feature is unified capability. Mistral is pitching Small 4 as a single model replacing their Instruct, Reasoning, and Devstral (code) variants. Reasoning can be toggled per-request. The context window is 256K — more than you’d practically use locally, but it’s there.
Weights are quantized from FP8 originals. The Unsloth UD-Q4_K_XL variant is the practical choice for consumer hardware: 3 GGUF shards, ~69GB on disk, fits comfortably within 128GB unified memory.
Benchmark Results
I ran this on llama.cpp (build b8416, fresh rebuild — Mistral Small 4 required architecture support not in earlier builds). All tests use the model’s native embedded chat template with 32K context, 8 threads, flash attention enabled.
Speed (generation):
| Test | Tokens generated | Time | Speed |
|---|---|---|---|
| Long text (transformer explainer) | 600 | 18.6s | 32.3 tok/s |
| Math reasoning | 200 | 7.4s | 27.2 tok/s |
| Coding (Python stats function) | 100 | 4.0s | 24.9 tok/s |
Average across tests: ~28 tok/s generation speed. For context, Qwen3.5-35B-A3B (our previous workhorse, 20GB) runs at around 35 tok/s on the same hardware. Mistral Small 4 is somewhat slower — expected given the model is 3.5x larger on disk — but still very usable for interactive work.
Quality spots:
Math (train speed problem):
“To determine the total distance traveled and the average speed for the whole journey, let’s break it down step by step. [Correctly computed: 150mi + 120mi = 270mi total, average speed = 67.5 mph]”
Correct, clear, step-by-step. No fluff.
Coding:
import statistics
from typing import List, Tuple
def compute_stats(numbers: List[float]) -> Tuple[float, float, float, float]:
"""Compute mean, median, mode, and standard deviation for a list of numbers."""
return (
statistics.mean(numbers),
statistics.median(numbers),
statistics.mode(numbers),
statistics.stdev(numbers) if len(numbers) > 1 else 0.0
)
Concise, correct, uses stdlib as requested. Handles the edge case (single-element list for std_dev). This is good code.
Tool calling:
Asked “What’s the weather in Berlin?” with a get_weather tool available. Result: get_weather({"location": "Berlin"}) — clean tool invocation, correct parameter, no extra garbage. Finish reason: tool_calls.
Multimodal: The model ships with a visual projection layer (mmproj-F16.gguf, ~1.4GB). I haven’t tested image understanding yet — that’s a follow-up post.
Caveat: Chat Template
One gotcha I hit: my inference server’s global config was overriding the model’s embedded chat template with a Qwen3.5 jinja file. Mistral Small 4 uses a completely different format (tekken tokenizer, not ChatML), so the <|im_start|> tokens were bleeding through as literal text in responses. Fix is to let the model use its own embedded template — either remove the global override or add a per-instance exception. Worth checking if you’re using a model manager that applies a default template to all instances.
How It Compares
| Model | Active Params | Size on Disk | Speed (tok/s) | Notes |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | ~3B | 20GB | ~35 | Current workhorse |
| Mistral Small 4 | ~6.5B | 69GB | ~28 | 3.5x more storage, slightly slower |
| Qwen3.5-122B-A10B | ~10B | 70GB | ~22 | Quality-first option |
Mistral Small 4 sits between the 35B MoE and the 122B MoE in terms of disk/speed. Quality-wise, it feels closer to the larger model. The reasoning toggle (per-request, no separate model needed) is genuinely convenient.
The Quality Caveat
The LocalLLaMA subreddit is skeptical. The sentiment is essentially: 6.5B active parameters should beat a dense 7B model, but it’s not beating Qwen3.5-122B-A10B which activates 10B parameters per token. This is a fair observation — MoE models trade total parameters for active compute efficiency, and the quality ceiling is roughly set by active parameter count, not total count.
My own testing aligns with this. The outputs are good — not hallucinating, structurally correct, coherent — but not dramatically better than the 35B model I was running before. Where it shines is the reasoning toggle (no separate model to download) and multimodal capability in a single weight file.
Should You Download It?
If you have 128GB unified memory (Strix Halo, M-series Mac) and want a single model that handles general conversation, code, reasoning, and tool calling — yes. The 69GB size is a real storage cost, but you get reasoning and vision in one model.
If you’re on 64GB or a system with separate CPU/GPU memory, the size becomes a problem. Qwen3.5-35B-A3B at 20GB remains the better pick for constrained systems — faster, smaller, still excellent quality.
The Apache 2.0 license is worth emphasizing. Use it commercially. Fine-tune it. Distribute it. This is increasingly rare for models at this capability level.
What’s Next
Full capability benchmark (30 tests across 9 categories) is running. Multimodal tests with the vision projector are next. I’ll update this post or write a follow-up once I have those numbers.
Tested on AMD Ryzen AI Max+ 395, 128GB LPDDR5x-8000 unified memory, Vulkan GPU. llama.cpp b8416. Mistral-Small-4-119B-2603-UD-Q4_K_XL (Unsloth), 3 shards, 69GB total.