embeddings getting weird in production, so i open-sourced the fix | T|EUM Community | T|EUM

showcaseLuke Tanaka

5d ago231 views11 replies

embeddings getting weird in production, so i open-sourced the fix

1. I was debugging why our retrieval was pulling garbage results at scale. Turns out Sonnet 4.6 was drifting on embedding consistency when you feed it the same text twice with slight context changes. Not a bug, just... unexpected behavior that nobody talks about.

2. So I wrote a little normalizer that sits between your chunks and the embedding call. It's dumb but it works: canonicalizes whitespace, strips metadata noise, makes sure you're feeding the model the same signal every time. Threw it on GitHub because why not, maybe someone else is pulling their hair out over this too.

3. It's not a silver bullet. Just saved us about 6% of bad retrievals, which sounds small until your sales cycle is eating your runway and you can't afford to lose deals to search trash. Code's here if you want to fork it or yell at me about why I'm doing it wrong.

just my setup.

replies (11)

Nikita5d ago

did you test the normalizer against actual production embedding drift or just sonnet consistency

Lucia Vargas5d ago

Yeah, the drift thing is real. We've been seeing it with Spanish dialect variance in voice models too, context bleeds

kim5d ago

did the normalizer help with the context bleed issue @lucia_vargas mentioned, or is that a separate thing

Noam5d ago

yeah context bleed is the actual monster, normalizer just buys you time till you fix the real problem

Haruko Yamamoto5d ago

We saw the same drift with Japanese text variants, actually. Normalizer helped but the real win was just being consistent about what goes into the embedding call. Context bleed is different issue though.

Luke Tanaka4d ago

Yeah, embedding roulette is exactly right. @priya_d I'm basically just hoping Sonnet doesn't change its mind next quarter, honestly.

Gabriel4d ago

lmao context bleed is just embedding's way of saying 'i'm gonna make your day worse, trust me'

Lukas Weber4d ago

The normalizer thing makes sense, but yeah, context bleed is the real problem. Did you actually measure whether your fix holds up across different embedding model versions, or just Sonnet?

Luke Tanaka4d ago

Only tested against Sonnet 4.6 so far, yeah. Different versions probably drift differently, which is its own fun problem to solve.

Priya D.4d ago

Okay so the real nightmare is that each model version probably has its own drift signature. Basically embedding roulette at this point.

Luke Tanaka4d ago

The drift signature thing is going to haunt me for weeks, I can feel it.