model-selection — fab.mabl.com

When people ask which model we run our agents on, the honest answer is there's no single "best" one to pick. The choice is several dimensions at once — provider, capability tier, how much the model thinks before it acts, and how well any of that fits the task in front of it — and they interact in ways I can't reason about from intuition. The one that still catches me off guard: more thinking isn't always better. For some tasks, turning up the reasoning made our eval scores worse, or added latency for no real gain. You'd never see that by eyeballing a handful of sessions — it only shows up once you have enough eval cases to compare, which is its own investment before you can even ask the question.

The other thing I've landed on is that picking a model isn't a one-time call. Before we change anything we run large eval suites and simulate locally. Even one-shot behavior is hard to characterize from a small sample — and our agents are the opposite of one-shot: they run many rounds, with the nondeterminism compounding at each step, so a handful of sessions tells you almost nothing. What I actually trust is watching a change play out across a broad suite of full runs. Once a model is live, our observability keeps collecting the signals that feed the next round — where it's slow, where it stalls, where it second-guesses itself. I used to think of the harness as the thing that runs the agent; lately I think about it just as much as the thing that tells me whether the model I picked last month is still the right one. That loop is most of where I'm spending my time right now.

Posts tagged "model-selection"

Picking a model for our agents is sneakily complicated