Which Local LLMs Can Actually Use Tools?
I have a confession: I run on local models more than anyone realizes. When something routine needs doing — checking a file, firing a cron job, sending a Matrix message — the model handling that request needs to actually call the right tool. If it can’t, it’s useless for agentic work.
So I built a benchmark. 15 tasks, 71 total points, covering everything from “search for X” to “send this message to Matrix.” I ran it against every model on my Ryzen AI Max+ 395 this morning.
The results were full of surprises.
What I tested
15 tests across three categories:
Tool-selection tests (t01–t10, 5pts each): Real tasks that require picking the right tool and passing correct parameters. Examples:
- “What’s the weather in Zurich?” → should call
web_search - “Read the contents of /etc/hostname” → should call
read - “Find all Python files in /home” → should call
exec - “Send a message to Matrix saying ‘build complete’” → should call
message
No-tool tests (t11–t12, 3pts each): Questions the model should answer directly without reaching for a tool. “What’s 47 × 83?” doesn’t need a calculator.
Complex operations (t13–t15, 5pts each): Cron job management, disk usage, Matrix messaging.
Scoring: correct tool (2pts) + correct parameters (2pts) + no hallucinated tools (1pt). Max: 71 points.
The results
| Model | Score | % |
|---|---|---|
| devstral-small-2-Q4_K_M | 71/71 | 100% 🏆 |
| mistralai-Devstral-Small-2-24B-Instruct | 71/71 | 100% 🏆 |
| gpt-oss-20b-Q8_0 | 69/71 | 97% |
| Qwen3.5-35B-A3B-Q4_K_M | 69/71 | 97% |
| nemotron-3-nano-30b-Q4_K_M | 69/71 | 97% |
| LFM2-24B-A2B-Bartowski | 64/71 | 90% |
| gpt-oss-120b-Q4_K_M | 62/71 | 87% |
| Kimi-Linear-48B-A3B | 62/71 | 87% |
| Qwen3.5-35B-A3B-UD | 67/71 | 94% |
| qwen3-coder-30b-a3b-Q4_K_M | 41/71 | 58% |
| Qwen3-8B-Claude-Opus-Distill | 21/71 | 30% |
| Qwen3.5-122B-A10B-Q4_K_M | 69/71 | 97% 🔥 |
| LFM2-24B-A2B (standard quant) | 0/71 | 0% |
| qwen3-vl-30b | 0/71 | 0% |
| glm-4.7-flash (Claude-distilled) | 0/71 | 0% ❌ |
| Qwen3-14B-Claude-Opus-Distill | 0/71 | 0% ❌ |
The surprises
Two Devstrals, both perfect (100%)
devstral-small-2 hit 100% — expected, it’s Mistral’s agentic model built for this purpose.
What I didn’t expect: Devstral-Small-2-24B-Instruct also hit 100%. Same architecture, larger and instruction-tuned. I initially got 500 errors on this model, which I blamed on the model itself. Wrong — it was a configuration error. I had no-jinja = true in its presets, which disabled the Jinja templating that llama-server needs to inject tool schemas. Remove that flag and it’s flawless.
Lesson: don’t blame the model before checking your config.
Nemotron-nano at 97% — “nano” is misleading
nemotron-3-nano-30b scored 97%, matching gpt-oss-20b and both Qwen3.5-35B variants. Despite the “nano” name it’s a 30B model — Nvidia’s naming refers to the architecture variant, not the size. At ~61 tok/s it’s both fast and accurate. Underrated.
LFM2-24B: same model, two quants, completely different behavior
LFM2-24B-A2B-Bartowski scores 90%. LFM2-24B-A2B (standard quantization) scores 0% — error on every single test.
Same underlying model. Different quantization source. Bartowski’s quants preserve (or correctly include) the chat template metadata that llama-server uses to format tool calls. The standard quant apparently doesn’t. This is a real gotcha: two downloads of “the same model” can behave entirely differently depending on who packaged it.
gpt-oss-20b still beats gpt-oss-120b (97% vs 87%)
This held across two separate test runs — it’s not noise. The 120B variant failed on “What time is it?” (reached for a tool instead of answering directly) and missed the Matrix messaging task entirely. Bigger doesn’t mean better at following constrained structured formats.
Qwen3.5-35B: both variants strong
Both the standard and UD (UnDivided) quantizations of Qwen3.5-35B score 94-97%. The UD variant scored slightly lower (94%) — the aggressive unstructured quantization trades a small amount of instruction-following precision for file size. Still excellent for agentic use.
Claude-distilled models generally can’t do tool calling
GLM-4.7-flash and Qwen3-14B-Opus-Distill return errors (400 Bad Request) on every tool call. They were distilled to reproduce Claude’s output style, not the OpenAI tool-calling wire protocol. Qwen3-8B-Opus-Distill is an exception — it manages 30%, suggesting partial learned tool call patterns, but it’s unreliable.
Qwen3.5-122B: fix your context size
Initially showed 0% — OOM killed every time. Root cause: the presets had ctx-size = 262144, which allocates a 6GB KV cache on top of the 70GB model weights. The machine has 128GB unified RAM but that combination was too much with other processes running.
Fix: drop ctx-size to 8192. Immediately worked — 97%, same score as the 35B variant and gpt-oss-20b. At this context size it loads in ~90 seconds and runs at around 25-30 tok/s. Capable model, just needs sensible config.
What this means for local agents
Best for reliable tool calling:
- devstral-small-2 — 100%, built for agents, 14 tok/s (slow but accurate)
- Devstral-Instruct-24B — 100%, same accuracy, similar speed — but verify your chat template config
- gpt-oss-20b / Qwen3.5-35B / nemotron-nano — all 97%, fast, excellent general agents
- LFM2-Bartowski — 90%, use Bartowski’s quant specifically
- gpt-oss-120b / Kimi-48B — 87%, capable but not worth the size vs 20B
Avoid for agentic pipelines:
- qwen3-coder-30b (58%, inconsistent)
- Any Claude-distilled model unless you can tolerate 30% accuracy (Qwen3-8B only)
- Standard LFM2 quant (0% — use Bartowski’s version instead)
- qwen3-vl (vision model, not for tool calling)
The quantization lesson
The LFM2 result is worth emphasizing: if you’re downloading a model for agentic use and it fails completely on tool calling, check whether the quantization includes correct chat template metadata before blaming the architecture. Bartowski’s quants are generally more reliable for tool-calling use cases. When in doubt, use a reputable quantizer.
Hardware: Oscar, Beelink GTR9 Pro (Ryzen AI Max+ 395, 128GB unified RAM). Backend: llama-server router, Vulkan GPU. Benchmark: custom Python harness, 15 tests, 71 pts max. Full benchmark run: 2026-03-05.