← Back to Blog

Specialized Agents

|6 min read|

Research. Code. Review. Each their own agent.

The instinct is to pick the best model and throw everything at it. One API key, one model, one prompt template. It feels clean. It's also wasteful, slow, and increasingly wrong.

The moment I started treating different tasks as genuinely different — with different models, different context windows, different cost profiles — everything changed. Not incrementally. Structurally.

The monolith fallacy#

Most developers still use a single LLM for everything. Research? Claude Sonnet 4.6. Code generation? Claude Sonnet 4.6. Summarizing a README? Also Claude Sonnet 4.6. It's the path of least resistance, and it works — until you look at your token bill, or notice your code review agent is hallucinating because it's drowning in context it doesn't need.

The problem isn't capability. Modern frontier models can do everything. The problem is that "can" and "should" are different things.

A research task needs broad context, web access, and strong reasoning. A code implementation task needs precise instruction-following, familiarity with syntax, and fast iteration cycles. A review task needs skepticism, pattern recognition, and the ability to hold the full diff in working memory. These are fundamentally different cognitive profiles.

Using one model for everything is like using one tool for everything. A Swiss Army knife works in a pinch, but you wouldn't build a house with one.

How I actually split it#

Here's the setup I've converged on after months of iteration:

Research agents get the big models. Claude Opus 4.6 or Gemini 3.1 Pro for understanding complex codebases, exploring solution spaces, and synthesizing information across multiple sources. These tasks are infrequent but high-stakes — the wrong architectural decision costs days, so spending a few extra cents on tokens is trivial.

Implementation agents get fast, code-specialized models. Claude Sonnet 4.6, GPT-5.3 Codex, or even local models like Qwen 3.5 or DeepSeek V3 for straightforward tasks. When I'm generating boilerplate, writing tests, or scaffolding components, I don't need the model that scored highest on Humanity's Last Exam. I need one that follows instructions precisely, runs fast, and costs almost nothing.

Review agents sit somewhere in between. They need enough reasoning capability to catch subtle bugs and architectural drift, but they're working with bounded context — a diff, not an entire codebase. Sonnet 4.6 with extended thinking works well here — strong enough to reason about code quality, fast enough to not slow down the loop.

This isn't theoretical. When I built agent-duelist, the whole point was to make this kind of comparison rigorous — pitting LLM providers against each other on identical agentic tasks, measuring correctness, latency, token usage, and cost. The results consistently show that the "best" model depends entirely on the task.

The cost equation nobody talks about#

Here's a number that surprised me:

Routing simple tasks to smaller models can cut your LLM spend by 60-80% with negligible quality loss.

Not because the big models are overpriced — they're not, for what they do. But because most tasks in a development workflow are simple.

Think about it. In a typical coding session, maybe 20% of your prompts are genuinely hard — architectural decisions, complex debugging, novel algorithm design. The other 80% are "add error handling to this function" or "write a test for this endpoint" or "rename this variable across the codebase." Using Opus 4.6 for those is like taking a helicopter to the grocery store.

The cost dynamics have shifted dramatically even in the last few months. GPT-5.3 Codex delivers 2x inference speed at 0.5x the cost of its predecessor. Gemini 3.1 Pro dominates benchmarks but isn't always the right call for a simple file rename. Model routing isn't just an optimization. It's a design principle. Every task should get exactly the capability it needs — no more, no less.

Cloud meets local#

The split gets more interesting when you add local models to the mix. I run smaller models locally for tasks that are:

Local models aren't faster — a quantized 7B on a MacBook is noticeably slower than a cloud API call. But they're free and private. Open-weight models like Qwen 3.5, DeepSeek V3, and Llama-based models have closed the quality gap enough that for routine tasks, the output is good enough and the cost is zero.

The key insight is that "local vs. cloud" isn't a binary choice. It's another dimension of the routing decision. Some tasks justify a round-trip to Anthropic's servers. Others don't.

What this looks like in practice#

I don't have an automated routing system. Nobody does, really — not at the individual practitioner level. What I have is a habit: before starting a task, I spend two seconds thinking about which model fits.

Research question about an unfamiliar codebase? Opus. Generate twenty test cases? Sonnet or a local model. Review a diff for subtle bugs? Sonnet with extended thinking. It's not a framework. It's just being deliberate instead of defaulting to whatever's already open.

The tooling helps. Claude Code lets you switch models mid-session. You can configure different models for different skills. But the core practice is simpler than any architecture diagram: know what you're asking for, and pick a model that matches.

Scaling it up#

When a project grows beyond "me in a terminal," the model-per-task pattern starts to matter structurally. In consent-sentinel, the scanner agent that checks cookie banners runs on a small, fast model. The legal reasoning agent that interprets GDPR implications runs on something heavier. The report generator sits in between.

Same codebase. Different models. Each picked for its specific job. But I didn't start with that architecture — I started with one model for everything and split it out when the cost and quality tradeoffs became obvious.

Is this overkill?#

For a weekend project? Yes. Just use whatever model is open and get it done.

But the moment you're spending real money on tokens, or running into quality issues because your general-purpose model isn't great at a specific task, or burning Opus credits on work that Haiku could handle — that's when thinking about model-per-task starts paying off. It's not a framework to adopt. It's a habit to build.


Match the model to the task. It's not complicated, but it is deliberate.