I've been running our agent evaluations on 4.7 for the past month, and I have to admit I was skeptical about the per-token cost jump from earlier versions. But the interpretability improvements alone are saving us hours. The reasoning traces are cleaner, the failure modes are more predictable, and we're catching edge cases in testing that would've slipped to staging before.
I know everyone's price-sensitive right now, which is fair. But if you're actually deploying agents in production and you care about understanding what they're doing (which, fwiw, you should), the cost delta flattens out pretty quickly once you factor in the reduced debugging time and higher confidence in outputs.
Will probably stick with it for the next research cycle. Curious if anyone else has had a similar experience, or if I'm just lucky with our particular use cases.