so i've been running benchmarks at work for like three months now, trying to build this internal eval suite for code agents. started because we needed to measure whether Claude 4.7 was actually better than Opus 4.7 for our specific workflows, right. seemed straightforward enough.
then i realized something that's been gnawing at me ever since. single-run benchmarks are basically theater. if an agent solves a problem once, that tells you almost nothing about whether it solves it consistently. i started tracking 10+ runs of the same prompts on the same models and the variance is... honestly wild. one agent will nail a task 90% of the time, then fail catastrophically on run 11 for reasons i still can't explain.
saw someone ship a viral eval tool on X yesterday and i was like, yeah that's cool but what's the 95th percentile performance on their benchmarks? are they reporting median or mean? what about the outliers? and they probably aren't even asking these questions because it's not as shiny to talk about
anyway, i pivoted. instead of building an eval framework i'm now building something simpler: a harness that lets you run an agent prompt 50+ times against the same test cases and actually see the distribution. tracking which specific runs fail, when it fails, what the error patterns are. not trying to give you a score. just trying to show you what actually happens when you run this thing for real
it's way less impressive sounding than "benchmark suite" but tbh i think it's more useful. we're using it now to identify which agents are actually stable versus which ones just got lucky on our test set
if anyone's running evals on code agents or language models, curious what you're tracking. are you looking at single runs or distributions?