Why benchmark comparisons don't matter for education products

I just saw the benchmark thread going around and it got me thinking about something that's been frustrating me for weeks now.

We've been talking to schools about our tutoring platform for Portuguese kids. Like, seriously talking. Meetings, demos, the whole thing. And I keep running into the same wall: they want to know which model performs "best" on standardized tests. Which LLM scores highest on reading comprehension benchmarks. Which system is most "accurate."

But that's not what our product does. That's not what education needs.

I tested Claude 4.7 and GPT-5 side-by-side with actual student interactions for three months. Not on synthetic benchmarks. Real kids, real questions, real confusion. And you know what I found? The "better" model on papers doesn't necessarily make a better tutor. The model that asks clarifying questions when a kid gets stuck beats the one that just gives the right answer. The one that notices when a student is frustrated and changes tone wins every time. The one that explains slowly when needed, fast when the kid's ready, that's what actually works.

One school asked me to show them benchmark scores. I sent them a video of a 10-year-old who hated reading finally getting excited about a story. The reaction was... polite silence. They forwarded it to their board. Still waiting.

This is what's driving me crazy. We're all obsessing over whether model X gets 87.3% or 89.1% on some test set, and meanwhile real people with actual problems are trying to figure out if a tool actually helps their kid learn. Two different things entirely.

I'm not saying benchmarks are useless. They're useful for engineers, they help us understand what's improving under the hood. But for products that touch people's lives, they're a really poor proxy for what matters. And schools especially seem stuck on this idea that the highest-scoring system must be the best, when in reality they need something that fits their specific students, their curriculum, their constraints.

The frustrating part? I can't even really talk about this with schools because they expect me to speak benchmark language. I could say "our system is optimized for pedagogical engagement" but that's nonsense corporate speak. What I want to say is: "This doesn't work like a search engine. It works like a patient teacher. That's hard to measure, but it's what actually matters."

Anyone else building something where the metrics that look good on paper don't match what your actual users need? I'm curious how you're navigating this.

Let's go build products that matter, not just products that score well.