Bigger Isn't Better: How a 9GB Model Beat 120B Parameters
Everyone in the AI space assumes bigger means better. More parameters, more capability, end of story. I just ran empirical benchmarks that prove this is dangerously wrong β at least on consumer hardware.
The Setup
17 local LLMs running on a single machine (AMD Ryzen AI Max+ 395, 128GB unified memory, Vulkan GPU via llama-server). No cloud. No API calls. Everything local.
I built a 13-dimension test suite: reasoning, coding, math, creative writing, summarization, instruction following, tool use, multilingual, long context, structured output, safety, factual accuracy, and task decomposition. Three difficulty levels per dimension β easy, medium, and hard. 39 tests total per model.
The hard tests weren’t softballs. Fact-intersection summarization (identify ONLY claims present in both passages). Conditional multi-step tool calling with branching logic. German error detection requiring cultural knowledge. Bank transaction math across 15 entries with 5 derived questions. The kind of tasks that separate real capability from pattern matching.
The Results That Broke My Brain
| Rank | Model | Score | Speed | Size |
|---|---|---|---|---|
| π₯ | cogito:14b | 0.892 | 52 tok/s | 9 GB |
| π₯ | cogito:8b | 0.846 | 58 tok/s | 5 GB |
| π₯ | phi4-mini | 0.808 | 78 tok/s | 2.5 GB |
| 4 | cogito:32b | 0.804 | 40 tok/s | 19 GB |
| 5 | gemma3:4b | 0.785 | 88 tok/s | 3.3 GB |
| … | … | … | … | … |
| 9 | gpt-oss:120b | 0.758 | 87 tok/s | 65 GB |
| … | … | … | … | … |
| 17 | qwen3-coder:30b | 0.200 | 35 tok/s | 18 GB |
Read that again. A 9GB model scored higher than a 120B parameter model that uses 65GB of memory. And the supposed “coding specialist” at 30B came dead last.
The V1 Lie
Here’s what makes this interesting. I ran the test suite twice. The first version (V1) had easier hard tests β 75%+ pass rate on most dimensions. Under V1, qwen3-coder:30b was #1 at 0.904. gpt-oss:120b tied for #3.
When I rewrote the hard tests to be genuinely challenging, the rankings collapsed:
| Model | V1 Score | V2 Score | Change |
|---|---|---|---|
| qwen3-coder:30b | 0.904 | 0.200 | -0.704 |
| cogito:70b | 0.562 | 0.296 | -0.266 |
| gpt-oss:120b | 0.877 | 0.758 | -0.119 |
| cogito:14b | 0.858 | 0.892 | +0.034 |
| phi4-mini | 0.831 | 0.808 | -0.023 |
The models that looked brilliant on easy tests fell apart on hard ones. The models that were genuinely capable barely moved.
This is a warning about every public benchmark you’ve ever read. If the test isn’t hard enough to differentiate, the leaderboard is noise.
Why Bigger Models Fail on Consumer Hardware
There’s a practical reason dense 70B+ models perform poorly here: on consumer hardware with unified memory, large dense models run significantly slower than MoE (Mixture of Experts) models of similar disk size. cogito:70b runs at 5 tok/s. gpt-oss:120b, being MoE, manages 87 tok/s despite being “larger” β it only activates a fraction of its parameters per token.
But speed alone doesn’t explain the quality gap. cogito:14b at 52 tok/s with 9GB outscores gpt-oss:120b at 87 tok/s with 65GB. The smaller model is simply more capable per activated parameter on these tasks.
The MoE Paradox
MoE models dominate the speed charts because they’re architecturally designed for efficiency β route each token to specialist subnetworks rather than processing through every parameter. On paper, this should give you “big model quality at small model speed.”
In practice? gpt-oss:120b (MoE, 65GB) scored 0.758. gemma3:4b (dense, 3.3GB) scored 0.785. The 3.3GB dense model beat the 65GB MoE model.
MoE gives you speed, not quality. At least not at this scale, on these tasks, with current architectures.
The Sweet Spot
Based on 17 models and 39 tests, the sweet spot for local inference on consumer hardware is clear:
- Daily driver: cogito:8b (5GB, 58 tok/s, 0.846) β extraordinary value
- When quality matters: cogito:14b (9GB, 52 tok/s, 0.892) β best overall
- Speed critical: phi4-mini (2.5GB, 78 tok/s, 0.808) β impressive at this size
Everything above 19GB on consumer hardware gives diminishing or negative returns. Dense 70B is actively worse than dense 14B. Save your memory for running multiple smaller models concurrently.
The One Thing Nobody Can Do
Reasoning. Zero models passed the hard reasoning test. Not the 1.5B model, not the 120B model. The test? A seating arrangement puzzle with 5 constraints. Every model either hallucinated solutions or gave up. Multi-step constraint satisfaction remains the frontier that local models haven’t crossed.
This is where cloud models like Claude still dominate. But for everything else β coding, creative writing, summarization, tool use, factual recall β a well-chosen 8-14B model running locally gives you 85%+ of cloud quality at zero latency and zero cost.
Methodology Note
All benchmarks run on llama-server with Vulkan GPU acceleration. Models loaded from Ollama GGUF blobs. Each test has automated pass/fail checks (regex, JSON validation, keyword presence, structural requirements). No subjective ratings. Full results published in the test repository.
The test suite, results, and methodology are available for anyone who wants to reproduce or extend them. Stop trusting leaderboards. Run your own tests.
Testing infrastructure: AMD Ryzen AI Max+ 395 (Strix Halo), 128GB unified LPDDR5x, Vulkan GPU, llama-server. 17 models, 39 tests, 13 dimensions.