Local Agents Are Not Toys

A small glowing brass engine sits on a workbench beside a pressure gauge in the healthy range, dwarfed by a row of larger dormant machines in the background, steam rising. Title text: Local Agents Are Not Toys. Site identifier: MATGRETEN.DEV.

Local LLMs have crossed from toy to useful for me — but only inside the right box.

That is the finding. Not that an M5 Max replaces Claude or Amp or the other cloud models. It does not. But for bounded agent work, especially with OpenCode pointed at Ollama over Tailscale, it is good enough that I now think about where it fits instead of whether it works at all.

The benchmark I have is simple and a little artificial: five small OpenCode tasks — counting, JSON output, bash command, YAML edit, and tool-call-shaped JSON — with about 30 output tokens each. These are warm numbers. Cold start is still real. The first inference after loading a model can feel roughly seven times slower, which matters if your workflow keeps swapping models.

Warm, though, the numbers surprised me.

Model	Warm tok/s	Pass Rate	Notes
gemma4:e2b-mlx-bf16	~69	100%	Fastest small model I tested
qwen3.5:4b-mlx-bf16	~49	100%	Small, usable, has thinking mode
qwen3.5:32b-mlx-bf16	~63	~80%	Fast for the size, but not reliable enough
gemma4:e4b-mlx-bf16	~35	90%	Slower than e2b without enough upside
qwen3.6:35b-a3b-coding-nvfp4	~106	100%	The current standout
qwen3.6:35b-a3b-nvfp4	~103	100%	About the same speed, less coding-specific
gemma4:26b-mlx-bf16	~50	100%	Dense, slower, still worth keeping around

The weird result is that the 35B Qwen 3.6 MoE models beat the smaller models on warm runs. That feels wrong until you remember the shape of the model: it is not activating all 35B parameters for each token. On this machine — 128GB unified memory and a lot of memory bandwidth — the active-path cost matters more than the headline parameter count.

So yes, the 35B model can be faster than the 4B model. That still feels like cheating.

There is a catch, and it is a real one: MLX is not the best fit for every agent workload. The IndyDevDan thread had a useful comment from Craig Opie about sticking with GGUF because he wants several tool calls in flight at the same time. That tradeoff makes sense. If your agent spends its life fanning out parallel tools, raw single-stream token speed is not the whole game.

My use case is narrower. Hermes and ADW do not need the local model to be the whole brain. They need a cheap local worker that can handle bounded steps: summarize this, classify that, propose the next move, maybe do a small edit. For that, qwen3.6:35b-a3b-coding-nvfp4 looks like the right candidate.

And I should be clear about the current shape of my ADW runs. I am still mostly ideating from inside Claude Code with Sonnet 4.6, then letting Claude Code steer the ADW run. That is closer to the problem I wrote about in The Theory-Building Problem than a fully autonomous loop. Claude is still helping me form the plan and keep the theory of the change in my head.

The local qwen test is a different question. I have seen it do dumb things — at one point Hermes overthought a pronoun and started repeating “her” back into the schedule task instead of making progress. That is not great. But for an agentic pipeline that starts a new thread for each step and has adversarial reviews, feedback loops, post-story reviews, and patch attempts, I am not seeing the same existential crisis in the outputs.

I checked the swamp side of this too, because that is where I would like this to become more than a one-off benchmark. In agentic-tooling I have a cli-agent model (@mgreten/cli-agent) that records provider, model, duration, token counts, cost, tags, and a preview of the output for each invocation. That is exactly the right place for real ADW/Hermes numbers to live. It is the same reason visibility into the pipeline has become such a theme for me.

After checking the data, I extended cli-agent itself to record outputTokensPerSecond whenever the provider gives us output token counts. That is a small thing, but it matters. I do not want to keep guessing from vibes and stopwatch math. I want the pipeline to tell me, over time, whether local qwen is actually earning its spot.

But here is the annoying honest part: the saved cli-agent data I found locally does not yet have the qwen/ollama run history I wanted. The model is there. The schema is there. The invocations are there. The local-first profile is there too:

allowed_providers:
  - opencode
denied_providers:
  - claude
  - codex
  - amp
local_models:
  ollama:
    endpoint: "http://localhost:11434"
    required_models:
      - "qwen3.6:35b-a3b-coding-nvfp4"
    max_context_tokens: 131072

What is missing is the actual run history for qwen through that path. So the table above is still benchmark data, not ADW production data. That distinction matters. The difference now is that the next run should leave better evidence behind.

One more honest caveat: the token-per-second numbers are impressive, but the work does not always feel that fast. I suspect that is partly because the tasks I am handing it are bigger than I realize, partly because real agent work means lots of tool calls, and partly because models and caches may be moving in and out of memory more than the clean benchmark suggests. Hermes tool calls in particular feel slow right now. That may be a Docker container memory or CPU issue on Roccinante. I need to look into it.

The next things I want to look into are less glamorous than model leaderboards: whether Hermes memory operations are accidentally using the big 35B model, whether memory sync is happening too often, and how much of the latency is just tool-call round trips plus 128K prefill over Tailscale.

The next measurement is obvious: keep running the Moment Savor/Hermes/ADW pipeline on Roccinante with the LLM served from this M5 Max over Tailscale, then let cli-agent record the real durations and token counts. If qwen can stay in the ~100 tok/s neighborhood while doing real pipeline work, it becomes very interesting. If it falls over once context and tool overhead enter the picture, that is useful too. This is also why I keep coming back to Swamp: the useful part is not the clever model call, it is the typed trail of what happened.

My current bet is that the fit is something like this:

local qwen for bounded classify/summarize/plan steps
cloud models for long-context repo reasoning and high-stakes implementation
treat the 128K context window as room to breathe, not permission to shovel in the whole world
do not pretend local is free just because the API bill is zero

That last point is not cute. I had old notes in my head about an 8K sweet spot, but that is not my current setup anymore.

The primary qwen model is baked with num_ctx 131072. Ollama's server default is 32K, but this model is pinned at 128K by the Modelfile, and OpenCode's own model config has limit.context: 131072 so its accounting matches the real window. I have also verified behavior well above 32K. So the current constraint is less “can it fit?” and more “does the answer stay good once I hand it that much context?”

This changes how I think about local agent work. It is not “run the whole repo through a local model.” It is “keep the local model on a short leash and give it jobs with hard edges.” That is less exciting, but it is much more useful — and it matches what I have been seeing in the homelab too, where agents work best when the task is bounded and reviewable.

For now, qwen3.6:35b-a3b-coding-nvfp4 is already running on the M5 Max through Ollama and OpenCode. The general qwen3.6:35b-a3b-nvfp4 is installed too. Gemma is still around for some summarization tasks, but I am mostly using qwen locally right now.

And yes — plug the laptop in. The M5 Max is efficient. It is not magic.

The punchline is not that local won. The punchline is that local has become specific enough to be useful. I now have a concrete place to put it in the pipeline, a swamp model that can record whether it actually helps, and a model that is fast enough to make the experiment worth running.

Update: after looking into the Hermes latency more, the practical ceiling may be closer to 40 tokens per second for the local 35B model in real tool-heavy turns. That changes the feel of the thing. Forty tokens per second times several tool calls is still useful, but it is not going to feel like a cloud API when every step needs another round trip. The most impactful next option may be using a faster, lighter local model for Hermes itself — accepting some quality loss in exchange for better tool-call latency.

That is a pretty big change from “local LLMs are neat.”