I've been wrestling with this for weeks now, so figured I'd throw it out there. We're building something for Latin America, and voice is a core part of the experience. The problem is that most voice models, even the paid ones, treat Spanish like it's one thing.
It's not. Mexican Spanish sounds nothing like Argentine Spanish. Colombian, Peruvian, Chilean, Central American, Peninsular - they're all different. And the thing is, our users notice immediately when it's off. They hear a neutral voice or worse, something that sounds vaguely Spanish but isn't their Spanish, and the whole interaction feels wrong.
I've been testing both proprietary solutions and open source alternatives. The paid players (not naming names, but you know who) offer better production quality overall, but their Spanish coverage is weirdly shallow for a region with 500 million people. You get a few regional flavors, maybe. It's like they optimized for scale, not authenticity.
So I started digging into open source options. Tacotron 2, Glow-TTS, some newer stuff being built by communities in Spain and Latin America. The quality variance is massive, but here's the honest tradeoff: you get regional specificity and the ability to fine-tune for your dialect, but you're taking on hosting, inference costs, and way more maintenance. If your user base is concentrated in one country or region, it might actually make sense. If you need to support multiple variants, suddenly you're managing multiple models.
I've also been experimenting with fine-tuning existing models against recordings from actual users. It's slower and messier than using a pre-trained voice endpoint, but the results feel more real. Obviously, this only works at certain scale though.
What's interesting is that the economics don't quite work out for commercial voice API providers to do this well. The market is too fragmented. Building good Spanish voice models for, say, Peru specifically, is niche work. But for us, it's the difference between a product that feels thoughtful and one that feels generic.
I'm curious if anyone else here is dealing with similar language specificity issues. Are you staying with the paid stuff and accepting the limitations? Building your own? Using open source and dealing with the infrastructure headache? I'm leaning toward a hybrid approach (pay for the polished tier, but supplement with open source for regional variants we really need), but I keep wondering if I'm overcomplicating it.
Thanks for reading this ramble, btw. Weekend brain. But genuinely interested in what others are doing.