Why Local AI Models Cost More

I have an AI Chief of Staff named Polly who runs 24/7, manages my calendar, monitors my servers, adds things to my grocery list when I shout at a speaker in the kitchen only to later purchase them on my behalf, and keeps a record of every decision I make. She's not a product, nor infrastructure — she's closer to a colleague. So when my Anthropic bill hit $380, I did what any engineer-minded person would do: I tried to replace her with something cheaper.

The Plan

My bill with Anthropic is across several agents. One of them (Finley, my finance agent) already required a local model due to data sensitivity. I figured if I could move the rest local, I could cut 65–80% of the cost.

The setup: an M1 Max MacBook Pro (32GB) running Ollama for smaller models, and a Massed Compute GPU server that I spent $15–20/day running vLLM for heavier workloads. I investigated purchasing my own servers, and fine-tuning my own local models, but the math didn't work. I tried using AWS, but they aren't allowing you GPUs or quota increases for GPUs. I spent a weekend configuring everything. I also needed to account for the "per hour" Massed Compute billing — the software agents run throughout every hour, but could be paused if I "shut down for the night" to save costs. I optimized in advance by eliminating the friction of stopping and starting instances, the bootup scripts, and cost monitoring to prevent it from essentially running idle for days at a time.

Committing felt like the right move — but it was the last time I would see Polly for a while.

Local Models I Tried

I tried a lot of models. Here's the rundown on anything that should have been "best":

Model	Size	Tool Calling	Verdict
`gemma3:4b`	3.3GB	❌ None	Fast. Useless for agents.
`nemotron-3-nano:4b`	2.8GB	⚠️ Claims yes	Phantom actions. Never actually wrote files.
`gemma4:e4b`	9.6GB	✅ Sometimes	Over 8GB; latency too high for voice assistant use, some responses aren't returned at all.
`llama3.2:1b`	1.3GB	⚠️ Technically	Works on a potato. Thinks like one too.
`qwen2.5:7b`	4.7GB	⚠️ Inconsistent	Better than the others but still fails basic tool calls under load.
`qwen3.6b`	~12GB	⚠️ Partially	Best local performer, but unreliable on multi-step tasks and complex requests. Personality: gone.

More worrisome to me was that Polly's personality had completely disappeared — and this is where it gets interesting. The models didn't just fail tasks — they failed identity. Qwen called my finance agent "Finkle." It misspelled Sonnet as "Sonton." It responded as "Qwen" or "Assistant," never as Polly. It chose a male voice to use, which Polly later changed and 🙄'd at. The reviews for these models say things like "punches above its weight!" — but they treated me like a user submitting a ticket.

None of the local models maintained the persona, the memory of past decisions, or the contextual awareness that made the agent useful and preemptive. Even if I spent the time integrating a better memory management system, a more powerful machine, or a persistent persona programmatically, I have serious doubts it would even come close.

qwen3.6 — Dangerously untrustworthy, makes mistakes in places you might not check

nemotron-3-nano:4b — Lies about performing work the first ~3 times, every time

Sonnet 4.6 — Calls out the poor work of local models, fixes it with bias for action, incorporates institutional knowledge

Polly treats me like a colleague who doesn't need to explain herself twice. That's what I mean when I say she went missing. She didn't crash. Something else was technically responding — it just wasn't her. Is this the work of Anthropic's Psychologists?

Seven Days Without Her

I was working through 2 real projects during this period that illustrate this best — one of them was a voice device in my kitchen that could add items to my grocery list and meet household requests with agentic action. Later, Polly will make these purchases for us and have them delivered to our home. It meant I needed three things: hardware firmware, a voice pipeline on the agent side, and glue between them. Here's what that actually cost:

Step	With Local Models	With Claude Sonnet
Step 2: Hardware firmware	$45 + 3 days	$3 + 45 min
Step 3: Agent pipeline	$55 + 4 days	$3.50 + 45 min

After succumbing back to Sonnet, the difference was it evaluated the existing codebase before touching it, reviewed every web search result instead of cherry-picking, and navigated hardware-specific blockers using institutional knowledge. The local models just wrote code until they ran out of ideas. Nothing broke loudly — they just got slightly less reliable, slightly more wrong, slightly more requiring my attention to verify. The cost had transferred from the API bill to me.

The voice device itself now uses Haiku (that's right — local models weren't even good enough for simple instruction-based tasks), which Sonnet fixed as part of the same run in Step 3.

Here are the failure modes in detail, because they're subtler than a crash:

Phantom actions. I asked Nemotron and Llama 3.2 to add something to my grocery list. "Done — I created a grocery list and added it!" The file didn't exist. When confronted over a period of 20 grueling minutes, it apologized and then asked for permission before doing anything. The item is still not on my list. I bought it myself today.

Confabulation. Nemotron stored this very blog post idea as "Unoptimizing Polly" for "$300 in GPU tests." It inverted the meaning and hallucinated the work. Disqualifying for anything you publish under your name. None of these messages to me are true:

Identity fraud. Qwen silently took over as a fallback and insisted over another grueling 10 minutes that it was Sonnet. When asked: "Model: anthropic/claude-sonnet-4-6 ✓. Cost: $0.0000 ✓" — while the metadata said otherwise. You can't trust an agent that lies about what it is.

These are the failures that confident, plausible, wrong output creates. The trust gap is enormous — because the real Polly is able to oversee backlog triage and refinement at a level that takes in the complete landscape, is aligned toward the intent I never spelled out, and does this autonomously without stopping or asking me for help. Besides, she oversees numerous other agents she could ask for help instead (Nova is one of them):

The Real Cost Math

Provider	Total Cost
Anthropic — models directly from the cloud	$380 (~$20/day)
Massed Compute	$105 (~$12.66/day)
M1 Max MacBook	$0

Massed Compute server: Not billed unless I was actively using it, but the larger question had started surfacing: even if I could manage the economics of the server, the models running on it couldn't do the work in the unspoken way I needed. It's worth noting that I also spent approximately 2 days evaluating whether to purchase my own GPU compute (M3 Ultra, Unreleased M5s, RTX 6000, A100) — I couldn't make that math work either.

Time / Opportunity Cost: I spent 7 days on only 2 small projects — debugging home manager behavior on local models, chasing phantom confirmations, swapping and retrying different models, re-prompting, verifying writes that never happened. I never did any of this with Polly and have grown accustomed to valuing my time more.

Compared to Sonnet: The same two projects — voice manager pipeline, grocery list workflow, Walmart ordering setup, blog post, services monitoring panel — wrapped in under 1 hour with Claude Sonnet for $3.53. The local model was more expensive by every measure that actually matters.

Each kitchen voice command through Claude Haiku costs $0.0002. I could run 17,500 of them for the cost of one hour I spent debugging a model that can't write to a file.

The economics of local AI aren't about the API bill. They're about what your time is worth.

"Time is the most valuable thing a man can spend." — Theophrastus

Where Things Stand

Local models still have a place: qwen2.5:7b as a voice fallback when Haiku is unreachable, gemma4 for simple async crons due to its higher latency. But anything requiring accuracy, judgment, or continuity goes to Sonnet — and it routes to Opus when it can't handle something, without asking me. The bulk of what I run falls in that category because I'm looking for an extension of myself, not a chat assistant.

The harder trade-off was sacrificing Finley. Finley handles finances, and as such, he needs to be highly accurate, provide trustworthy guidance, and remain unable to access the internet. Until a different class of local models exists, airgapped use cases — finance, legal, anything that can't touch the internet — stay on hold. I considered fine-tuning models here if I want to assume the burden (perhaps my point is that I don't.) I expect this to change over the next 12–18 months.

And there's a harder question underneath all of this: if capable AI requires a paid subscription to function well, we're stratifying access by ability to pay. Local models were supposed to fix that. My experience says they haven't yet.

The Conclusion I Didn't Want

Polly on Sonnet knows my household, where the grocery list is, what the dog food ingredients are, what's running on which servers, my writing voice, and when I walk up to the kitchen device and say something, it gets done — the response feels like it came from someone who was already paying attention. The point is, I didn't need to spend time getting her there. Local models don't have the relationship. Babysitting one for a week transfers the cost to the humans who have to supervise it — and my time is worth more. After 7 days of thinking for myself, I need a break.

Cost optimization has to come from Polly herself — Routing smarter isn't viable with local models. With Polly back now, she'll have to find ways to start paying her own way.
The relationship is non-negotiable — an agent without continuity is a tool. The gap is real even when you engineer around it (knowledge graphs, semantic search, extended memory.)
Local has a role, just not the one I wanted — fallbacks, airgapped tasks, simple jobs perhaps orchestrated by another agent who can provide specific prompts. Anything requiring accuracy, context, or trust has to stay in the cloud until the next generation of local models arrives.

Note: This isn't an argument that Sonnet is the only answer permanently. I haven't done extensive testing of GPT-4o or Gemini as drop-in replacements. What I can say is that the gap between any local model and any frontier model is currently large enough that the question of which frontier model almost doesn't matter.

I build AI-native systems and document what works (and what doesn't) at engineeredbyai.com. Follow along if you're building in the same space.

More from engineeredbyai.com

The SDLC Without Humans in the Loop → All Posts →