Posts tagged "harness"

1 post · all topics

We rebuilt our agents five times so you don't have to — here's what we learned

Over the last few years we've rebuilt our test-authoring agent five times. In normal software, rewriting a core feature that many times is a flashing red light; when the models shift under you every few months, it's mostly just what staying current costs. The useful part is what survived every rewrite — five things that held no matter which model we were running on.

The agent flywheel — the five lessons as a cycle, each rebuild spinning it faster


Work with the model's grain, not against it. Back in 2023 I tried to get PaLM to pick the single most similar word to a target in the DOM, for smarter auto-healing. It failed every way I framed it — but the outputs showed it clearly understood the task. It was bad at picking one similar word and surprisingly good at grouping words by meaning. So we rebuilt auto-healing around semantic grouping instead of arguing with it, and a hard limitation turned into a reliable feature. These models have a grain, like people do; you get further shaping the system around what they're already good at.

An agent is only as good as its input. Most of the agent failures I've chased turned out to be upstream of the model — a planning session stuffed with base64 screenshots the model can't read, a tool returning a vague error the agent couldn't act on, the right context buried under noise. The model usually wasn't the problem; the signal we handed it was. We've gotten far more mileage out of cleaning up what goes in — tighter tool definitions, scoped context, error messages written for the model to actually use — than out of swapping the model itself.

Embrace non-determinism on purpose. Testing is supposed to remove uncertainty, so deliberately putting a probabilistic model in the middle of it felt like a category error. It wasn't. When we leaned into letting a model judge whether an application state was right — in plain language, the way a person would — it expanded what automated testing could even cover. The trick was using the model where judgment beats a brittle assertion, not everywhere.

Measure trust with data, not vibes. For a while we tested our AI features with small hand-curated sets and spot-checks, and it left us blind to silent regressions. The fix was building evaluator suites — the same judge idea, pointed inward — that grade our agents across many runs. That's how we caught a regression hiding inside an upgrade: moving to Gemini 3 cut reasoning loops by something like 2–4x, but the same report flagged hardcoded values going up, because the model was now finishing harder tests that older ones gave up on. I'd never have spotted that by eye.

AI amplifies whatever foundation you've built. When we scaled coding agents across our repos this year, the biggest thing slowing them down wasn't the AI — it was build times, shaky CI, thin test coverage. The same fundamentals that slow people down. Fixing them helped the agents and the humans in equal measure. A non-deterministic tool doesn't make error handling, reusability, and tests matter less; it runs your weak spots over and over until they show.

What strikes me looking back is how little of this was about the model getting smarter. In 2023 we couldn't get a model through a login screen; today the model is rarely the bottleneck. The harder questions now are fit and cost — whether an agent has the context about what you care about, works inside the tools you already use, and earns its keep on latency and token spend instead of taxing the team. The bar I keep coming back to is whether it behaves like a good teammate.