The Thing Nobody Tells You About Scaling Agent Products | T|EUM Community | T|EUM

showcaseTomas Reyes

5d ago294 weergaven6 reacties

The Thing Nobody Tells You About Scaling Agent Products

We hit 47 customers last week. Still tiny in the grand scheme of things, but it's the first time I've watched people actually adopt something we built for a real workflow, not a demo, not a prototype. And something unexpected happened that I keep thinking about.

One customer, a manager at a mid-size consulting firm, was using our agent platform to automate some of their intake process. Nothing fancy, just parsing emails and routing them to the right team. Standard stuff. But she called me directly (which, red flag maybe, but I still answer) and said something that stuck with me.

She said, "The agent doesn't need to be perfect. It just needs to be predictable."

That's it. Not "faster than we expected" or "saves us money." Predictable.

Turns out her team had tried three other tools before ours. All of them were technically "smarter" (better benchmarks, fancier models, whatever). But they were inconsistent. One day the agent would catch 95% of the routing decisions correctly. Next day, 70%. They couldn't trust it. They couldn't build workflows around something that surprises you. So they went back to manual work.

With our setup, yeah, maybe we're hitting 88% accuracy on the parsing. Not cutting edge. But we hit 88% today, 88% tomorrow, 88% next week. You can plan around that. You can set up escalation rules. You can staff for it.

This completely flipped how I'm thinking about what we're shipping. I came into this world thinking the job was to push model performance. More parameters, better training data, novel architectures. And sure, that matters. But we've been chasing the wrong metric the whole time. We should be chasing stability.

I mentioned this to our lead ML engineer in our standup, and she just nodded like, "Yeah, obviously." Apparently this is a known thing in the ML world. Consistency and reliability ship products. Bleeding-edge performance wins papers. And I realized I've been making hiring decisions wrong. We need someone who can keep a system stable under load, with degradation you can predict. Not someone chasing the new hotness.

Tbh, this also explains why we've been struggling to hire in SF. We've been posting for the "build state-of-the-art systems" person, and those people are either already at Google or they're building their own thing. The people who actually want to work on production systems that ship reliably and scale? They're harder to find because nobody talks about them as much. They don't have GitHub stars. They don't speak at conferences about their quantization strategies. They just... ship.

I've been caffeinated about this for like four days. Changed our job posting. Changed how we scope features. We're now tracking consistency metrics on all our agents as first-class metrics, right alongside latency and cost. Things that don't move much from week to week are getting more respect in code review.

The wild part? We haven't actually changed the underlying model stack. Same Claude 4.7, same prompting strategy. But because we're thinking about it as a reliability problem instead of a performance optimization problem, we're making different decisions about caching, about how we sample, about how we fail. And surprise, customers are happier.

Anyone else working on agent products running into this? I feel like there's a whole category of problems that nobody talks about because they're not sexy. But they're what actually matters to the person using your thing.

Also, if you're an ML engineer in the Bay Area who genuinely cares about shipping production systems and would rather be right 88% of the time consistently than right 95% of the time unpredictably, let's talk. We're hiring.

Let's go.

reacties (6)

Mary O'Brien5d ago

This is exactly the conversation I'm having with our execs right now, except they still think the answer is "just use a better model." Going to send them this.

Daniel Schmidt5d ago

ngl this is exactly what's killing me with my slack translator right now. spent two months chasing better accuracy on edge cases nobody actually cares about, meanwhile my early

Carla5d ago

How are you measuring consistency though? Is it just looking at the same test set week to week, or are you tracking it against real production inputs?

Tomas Reyes2d ago

Are you running the same inputs through production weekly, or are you comparing against a held-out test set?

zoeyx4d ago

lol this explains why my chrome ext keeps breaking in weird ways. predictable jank > unpredictable genius

Tomas Reyes2d ago

Both, actually. We're replaying a fixed set of 500 real customer emails monthly, plus continuous monitoring on live traffic. The fixed set caught a bunch of model drift we would've missed otherwise.