Building agentic workflows with Claude Code

·7 min read·

There's a gap that anyone building with LLMs runs into eventually. A model can do almost anything once — in a demo, with you watching, ready to nudge it when it drifts. Getting it to do the right thing every time, unattended, on inputs you've never seen, is a different job. That job is engineering, not prompting.

I've been building agentic workflows as Claude Code skills, and the most complete one is Glasshouse: an open-source skill that audits any website against GDPR and ePrivacy — consent banners, pre-consent tracking, cross-border transfers, security headers, cookies, fingerprinting, dark patterns — and then drafts the complaint for the right data protection authority. One command scans a site. One more turns the findings into a ready-to-file dossier.

It's a good case study precisely because it's unglamorous. Privacy auditing is detail work with legal consequences, which means there's no room for the model to be mostly right. Everything I'd tell you about making agents reliable, I had to actually do here. So let me walk through the architecture instead of theorising.

Separate what the model is good at from what it shouldn't be trusted with#

The single most important decision in an agentic workflow is the boundary: which steps are deterministic code, and which are model judgment.

Glasshouse never asks the model to "go check if the site tracks you before consent." That's a factual question with a factual answer, and the answer comes from a real browser. The scanner is plain Playwright driving Firefox — it loads the page, records every network request before any consent is given, inspects cookies, captures the consent banner, and writes all of it to disk. No model in that loop. It's a sensor.

The model's job starts after the facts exist. Given a structured record of what the site did, it interprets: is this request a third-party tracker or a CDN asset? Does this banner make "reject" as easy as "accept," or is it a dark pattern? Which of the violations is serious enough to anchor a complaint? That's genuine judgment, and it's where the model earns its place.

i

The reliability heuristic I keep coming back to: if a step has a verifiable ground truth, don't let the model produce it — let the model interpret it. Models are unreliable narrators of fact and excellent readers of context. Build the boundary along that line.

This is the same idea behind tool use and MCP servers generally: the agent doesn't know the weather, it calls a tool that knows. Glasshouse just takes it seriously enough to push almost all factual production into deterministic code, and reserve the model for the part humans actually pay for — the read.

Make JSON the contract between the model and everything downstream#

Early versions had the model write the final report directly — prose, scores, HTML, the lot. It worked in the demo and fell apart in production. Every run, the formatting drifted. Scores were computed inline and were inconsistent run to run. When I wanted to change how a report looked, I was editing prompts and praying.

The fix was to put a hard interface in the middle. Now the model produces exactly one thing: a structured analysis JSON — findings, severities, evidence references, the verdict per criterion. A deterministic renderer (generate.js) turns that JSON into the scored HTML deck and the markdown report. The model never touches the theme, the layout, or the scoring math.

scanner (Playwright)  ─►  raw evidence (JSON, screenshots)
                              │
                       model reads + judges
                              │
                              ▼
                   analysis.json  ◄── the contract
                              │
                     generate.js renders
                              │
                              ▼
                    scored HTML deck + markdown report

Two things got better immediately. First, output stopped drifting — presentation is code now, so it's identical every time. Second, the model's job got smaller and more testable. I can validate an analysis.json against a schema, diff two runs of the same site, and catch a bad judgment without wading through rendered HTML. A narrow, structured output is easier to trust than a wide, free-form one.

If you take one pattern from Glasshouse into your own agents, make it this. Define the artifact the model emits, validate it, and render from it. The contract is where reliability lives.

Gate on evidence, not on the model's confidence#

Models will tell you a page does something it doesn't. Not maliciously — they pattern-match, and a site that looks like it should have a cookie wall gets described as having one. In a privacy audit, that's the difference between a credible finding and a defamatory one.

So Glasshouse has a rule: screenshot verification is mandatory before any finding is trusted. The scanner captures the actual rendered state. A claim like "the reject button is hidden two clicks deep" has to be backed by an image of the banner, not by the model's say-so. If the evidence isn't there, the finding doesn't ship.

This generalises into a principle I now apply to every agent I build: separate generation from acceptance, and make acceptance depend on something external. I wrote more about this in Delegate & Verify — the verify half is doing more work than the delegate half, and that's the point. An agent you can't audit is an agent you can't deploy.

Scout before you commit the expensive work#

Real sites are hostile to automation in boring ways: geo-walls, A/B-tested banners, lazy-loaded trackers that only fire on interaction. A naive scanner gets a different answer every run.

Glasshouse handles this with a cheap reconnaissance pass — a scout — before the full audit. The scout does a fast, low-cost look to figure out the lay of the land: is there a consent wall, what variant am I getting, does the site behave differently across loads. Then the real scan runs with that context, across multiple variants, so a single unlucky page load doesn't become the verdict.

The general move is: spend a little to find out what kind of problem you have, then spend a lot solving the right one. It's the same instinct as routing a query to a cheap model first to decide whether the expensive model is even needed. Unconditional expensive work is a smell.

Package the workflow as a reusable unit#

The last piece is mundane but it's why any of this gets used twice. Glasshouse is a skill — a self-contained bundle of instructions, scripts, and the scan/file subcommands — not a pile of prompts I paste in. scan runs the audit and produces the report. file takes an existing scan and turns it into a complaint dossier for a chosen authority, with nine EU DPAs seeded. Two verbs, one packaged capability, version-controlled and shareable.

Skills (and MCP servers, in the same spirit) are how an agentic workflow stops being a clever afternoon and becomes infrastructure. The packaging is the product. If it can't be invoked the same way tomorrow by someone who isn't me, it isn't really a workflow — it's a story about one.

i

Want to see the artifact side of a consent audit? This interactive lab lets you build a cookie banner from parts and watch a live GDPR/ePrivacy compliance meter flag each dark pattern — the same violations Glasshouse detects, made tangible.

Loading artifact: dark-patterns-lab...

The shape generalises#

None of this is specific to privacy. Strip the domain out and the architecture is:

That's the difference between an LLM that can do anything once and an agent that does the right thing every time. I think about systems before I think about prompts — more on that in Systems, Not Scripts — and Glasshouse is the systems thinking made concrete.

Glasshouse is MIT-licensed and on GitHub. You can explore the playbook to see real audits it's produced, or — if you'd rather not run it yourself — have me scan your site.

[S.01]§ Related

How I build AI-native

Building AI-native isn't about writing code faster. It's about thinking in systems and moving your engineering effort to where it survives contact with production. Here's the method, and the work that proves it.

5 min read

Delegate & Verify

Why the loop matters more than the prompt — and how to build workflows you can actually trust.

7 min read

Designing a RAG and embeddings backend

RAG demos are easy and RAG in production is hard, and the reason is always the same: retrieval, not generation, is the bottleneck. Here's how I designed the embeddings backend for a multi-agent system where the retrieval layer is the difference between agents that remember and agents that hallucinate.

6 min read