← Back to Blog

Delegate & Verify

|7 min read|

Agents do the work. Judgment stays human.

There's a seductive idea floating around the AI discourse: that the endgame is full autonomy. Agents that do everything. Humans that do nothing. Ship it and go to the beach.

I don't buy it. Not because the models aren't capable enough — they are, increasingly. But because the interesting part of building software was never the typing. It was the judgment. And judgment is the one thing you should never fully delegate.

The delegation spectrum#

Not all delegation is equal. There's a spectrum, and where you sit on it determines whether AI makes you faster or just makes you wrong faster.

On one end: no delegation. You write every line yourself, and the AI is basically a fancy autocomplete. Safe, slow, and a waste of what these models can actually do.

On the other end: full delegation. You describe what you want in natural language, walk away, and hope for the best. This is "vibe coding" in its purest form — and it works surprisingly well for prototypes. It also produces code that nobody understands, nobody reviewed, and nobody wants to debug at 2 AM.

The sweet spot — the place where I've found the most leverage — is in between. Delegate the execution. Retain the verification. Let agents do the work. Then actually look at what they did.

What "verify" actually means#

Verification isn't glancing at a diff and hitting approve. That's rubber-stamping, and it's worse than no review at all because it creates the illusion of oversight.

Real verification means:

Understanding the approach before evaluating the code. When I ask Claude Code to implement a feature, I don't start by reading line 1 of the diff. I start by asking why it made the choices it made. What alternatives did it consider? What tradeoffs is it accepting? If it can't articulate that — or if the articulation doesn't match what I'd expect — that's a signal.

Testing the boundaries, not just the happy path. Agents are optimistic by nature. They generate code that works for the intended use case. They rarely think about what happens when the input is null, the network is down, or the user does something absurd. Verification means asking: "What breaks this?"

Checking for drift. Over a long session, agents can gradually shift away from your architectural patterns. They start introducing a slightly different error-handling style, or a new dependency you didn't ask for, or a structural pattern that doesn't match the rest of the codebase. Each change is small. The cumulative effect is chaos. My CLAUDE.md file exists precisely for this — it encodes the patterns and constraints that keep the agent aligned across dozens of interactions.

The loop > the prompt#

Here's what I've learned from building projects like x-molt and the artifacts on this site almost entirely through conversation with Claude Code: the quality of the output isn't determined by any single prompt. It's determined by the quality of the loop.

A loop looks like this:

  1. Specify — describe what you want at the right level of abstraction (not too vague, not line-by-line)
  2. Delegate — let the agent implement it
  3. Verify — review the output against your intent, your constraints, and your standards
  4. Iterate — feed back what needs to change and go again

The magic isn't in step 1 or step 2. It's in steps 3 and 4. The feedback you give after seeing the first attempt is worth ten times more than the initial prompt, because now you're reacting to concrete output instead of speculating about abstract requirements.

This is why prompt engineering, as a standalone discipline, has a ceiling. You can craft the perfect prompt and still get mediocre results if you can't evaluate and iterate on the output. Conversely, a mediocre prompt followed by three sharp feedback cycles produces better work than a perfect prompt with no follow-up.

The trust gradient#

Not all tasks require the same level of verification. I think about it as a gradient:

The mistake is applying the same verification intensity to everything. That's either too slow (reviewing boilerplate line by line) or too risky (glancing at a database migration). Match the oversight to the stakes.

What breaks without verification#

I've seen the failure modes. I've caused the failure modes. Here's what happens when you delegate without verifying:

Subtle incorrectness. The code runs. The tests pass. The feature works — for the cases you tested. But the agent made an assumption about date formats, or timezone handling, or the shape of an API response that doesn't hold in production. This is the most dangerous failure mode because it's invisible until it's expensive.

Architectural erosion. Each individual change is fine. But over fifty changes, the agent has introduced three different state management patterns, two logging approaches, and a circular dependency. Nobody noticed because each diff looked reasonable in isolation.

Dependency creep. Agents love adding packages. "Oh, you need date formatting? Let me add moment.js." No. We use date-fns. This is in the CLAUDE.md. But if you're not checking, the agent will gradually turn your lean project into a node_modules black hole.

Building the muscle#

Delegate & verify isn't a technique. It's a practice. Like code review, it gets better the more you do it, because you develop an intuition for what to check and what to trust.

After months of working this way, I can scan a diff and immediately spot the patterns that need scrutiny: new imports I didn't expect, error handling that's too optimistic, type assertions that bypass the compiler, or abstractions that solve problems I don't have.

That intuition doesn't come from prompting skills. It comes from engineering experience. AI-native development makes that experience more valuable, not less.

The agent handles the implementation. Your expertise handles the judgment.

The paradox of speed#

Here's the counterintuitive part: adding verification makes you faster, not slower.

Before AI tools, most of my time went to implementation — the actual typing. Design and review were squeezed into whatever was left. Now the implementation is fast (the agent handles it), which frees up time for the parts that actually determine quality: thinking about the approach, reviewing the output, catching the things the agent missed.

The total wall-clock time is shorter, and the code is better because every piece has been through at least one deliberate review. The people who skip verification ship faster for the first week. Then they spend the next month debugging issues that a five-minute review would have caught.

The human in the loop isn't a bottleneck#

There's a growing pressure — from tool vendors, from Twitter, from the productivity discourse — to remove humans from the loop entirely. To let agents go end-to-end. To optimize for throughput.

Resist it. Not because humans are infallible, but because the value of the human in the loop isn't just catching errors. It's maintaining understanding. If you can't explain what your codebase does and why, you've lost something more important than speed. You've lost the ability to make good decisions about what to build next.

Delegate the work. Verify the output. Maintain understanding. That's the loop.


The loop matters more than the prompt. The judgment matters more than the generation.