Always Evolving

The frontier moves daily. So does the practice.

Six months ago, the models I relied on for most of my work didn't exist yet. Claude Sonnet 4.6 shipped in February. GPT-5.3 Codex dropped the same month. Gemini 3.1 Pro followed a week later. DeepSeek V4 is landing as I write this. The framework I use for agent orchestration had a different API. The tool I use for terminal-based development had shipped three major updates. The cost of the inference I run daily had dropped by half.

This isn't unusual. This is the rhythm.

The treadmill problem#

The pace of AI model releases has gone from annual to quarterly to monthly to — at this point — continuous. February 2026 alone brought twelve significant model updates across the major labs. Over 255 model releases have been tracked in Q1 2026 so far. Every major provider is shipping updates that range from incremental improvements to paradigm shifts, and the intervals between them keep shrinking.

For most developers, this creates a treadmill effect. You learn a tool. You build with it. You get comfortable. Then a new version drops that changes the API, deprecates your approach, or introduces capabilities that make your architecture look like a horse and buggy. OpenAI deprecated GPT-4o, GPT-4.1, and GPT-5 (Instant and Thinking) in February — models that were cutting-edge months prior. You either adapt or fall behind.

The instinct is to resist this. To find the "stable" platform and stick with it. To wait for the dust to settle.

The dust isn't going to settle. There is no stable platform. The entire field is in motion, and the teams that treat this as a problem to solve rather than a reality to embrace are the ones getting left behind.

What I actually track#

I'm not tracking every model release, every benchmark score, every Twitter announcement. That's a full-time job, and it's not the same as building things. Here's what I do track:

Capability thresholds. When a model crosses a threshold that changes what's possible — like when Gemini 3.1 Pro scored 44.4% on Humanity's Last Exam (doubling the previous best), or when context windows expanded to 1M tokens as a standard feature — that matters. Incremental benchmark improvements on the same tasks? Less so.

Cost-per-token trends. This is the most underrated signal. When GPT-5.3 Codex ships at 2x the speed and 0.5x the cost of its predecessor, it doesn't just save money — it changes which architectures are viable. Multi-agent systems that were too expensive six months ago become trivially cheap. Tasks you'd only run in batch become feasible in real-time.

New modalities and tool access. When models gain the ability to use new tools — web browsing, code execution, file system access, MCP integrations — that's a structural shift. It doesn't just improve existing workflows; it enables entirely new ones. MCP has gone from experimental to table stakes in under a year.

Developer experience changes. How we interact with models matters as much as what the models can do. Claude Code moving development into the terminal — and now at version 2.x with native VS Code extensions, the Claude Agent SDK, and subagent orchestration — was a bigger shift for my daily practice than any individual model improvement.

Running experiments, not following hype#

There's a difference between reading about a new model and running it against your actual tasks.

When a new model drops, I don't spend hours reading benchmark comparisons. I run it through agent-duelist↗ — the tool I built specifically for this. Same tasks, same evaluation criteria, different providers. Correctness, latency, token usage, cost. Quantified, not vibed.

This matters because benchmarks lie — not deliberately, but structurally. They measure performance on specific task distributions that may have nothing to do with your actual workload. A model that tops the leaderboard on competitive programming might be mediocre at reviewing code diffs. A model with impressive reasoning scores might be terrible at following structured output formats.

The only benchmark that matters is your benchmark. Running on your tasks. With your data. In your workflow.

Local experimentation#

Not everything needs to run in the cloud. I keep a local setup for experimenting with models that don't need frontier-level capability:

Smaller open models like Qwen 3.5 and DeepSeek V3 for testing whether a task actually needs a big model (often it doesn't)
Quantized versions for exploring cost-sensitivity — what's the minimum model size that produces acceptable quality?
Specialized open-weight models for tasks where I need consistent, domain-specific behavior without per-token costs

Local experimentation is free and private, even if it's not fast — a quantized model on a MacBook won't outrun a cloud API. But the zero cost means you can iterate without watching the meter, and the privacy means you can test with real data. It's how I figure out the right model-to-task mapping before committing to an architecture.

The pattern, not the tool#

Here's the meta-lesson from two years of continuous AI evolution: the patterns outlast the tools.

Specific models come and go — often faster than you'd expect. Claude Opus 4 and 4.1 were deprecated in January, barely months after launch. GPT-4o was retired from ChatGPT in February. Framework APIs change. Provider pricing shifts. But the underlying patterns — task decomposition, human-in-the-loop verification, composable architectures — remain stable across every shift.

When I built consent-sentinel↗, I designed the multi-agent architecture around patterns, not around specific model capabilities. When Sonnet 4.6 shipped and outperformed what I was using, I swapped the model in the config. The orchestration, the task decomposition, the verification logic — none of that changed.

This is the difference between building on a model and building on a principle. Models are volatile. Principles are durable.

Staying sharp without drowning#

My actual routine:

Weekly: Scan releases from Anthropic, OpenAI, Google DeepMind, and open-source labs. Not deep dives — just headlines and changelogs. Five minutes, max.

When something looks significant: Run it through my benchmarks. Test it on my actual tasks. Compare it against what I'm currently using. If it's meaningfully better, integrate it. If it's marginal, note it and move on.

Monthly: Review my model routing config. Are the model assignments still optimal? Has something new made a cheaper option viable for a task I was overpaying for? With the current pace — Sonnet 4.7 and Opus 4.7 are already rumored for this month — this review is never wasted.

Continuously: Build things. The best way to stay current isn't reading — it's building. Every project forces you to engage with the current state of the art, because you're constantly making decisions about what tools to use and how to use them.

This site is an artifact of that practice. Every project in the archive was built with whatever was best at the time. Looking back at older artifacts, I can see the evolution in real time — the techniques that worked, the patterns that persisted, and the approaches I've since abandoned.

Embracing the discomfort#

There's a specific discomfort that comes with always evolving: the feeling that what you're building right now might be obsolete next month. That the architecture you're designing will need to change. That the tool you just mastered will be superseded.

This discomfort is productive. It forces you to:

Build for adaptability over permanence. Loose coupling, clean interfaces, swappable components.
Document your decisions, so when you need to change them, you know why you made them in the first place.
Invest in understanding, not just usage. If you understand why a model works well for a task, you can evaluate its successor in minutes instead of days.
Ship anyway. Waiting for the "right" model or the "stable" framework is a form of procrastination. The best time to build is now, with the best tools available today.

The trajectory matters more than the snapshot#

If you zoom out from the daily noise of model releases and benchmark wars, the trajectory is clear: AI is getting dramatically better, cheaper, and more accessible on a timescale of months. Every architecture decision you make should account for the fact that next quarter's models will be better than this quarter's.

This doesn't mean waiting. It means designing systems that can absorb improvement. When a better model drops, the right architecture lets you plug it in and immediately benefit. The wrong architecture — tightly coupled to a specific model's quirks — forces a rewrite.

The engineers who thrive in this environment aren't the ones who know the most about any single model. They're the ones who can learn fast, adapt faster, and maintain a practice that evolves as quickly as the tools.

That's what "always evolving" means. Not chasing every release. Not optimizing for today's benchmark. Building a practice — a way of working — that gets better every week because the foundations are solid enough to absorb constant change.

Tracking model releases, running local experiments, studying what works. Building on today's best, not last year's patterns.