why your rag pipeline is probably slower than it needs to be

ok so i was between meetings this morning scrolling through some benchmarks on retrieval and chunking strategies, and yeah no, i keep seeing the same pattern. everyone's doing vector search like it's 2023 and just... hoping the reranker will fix it downstream

i'm dealing with this exact problem right now in our fraud detection pipeline. we're pulling context from transaction history, merchant patterns, user behavior graphs, whatever we need to score a decision. early on we just threw everything into a vector db, chunked at arbitrary boundaries, let Claude 4.7 figure it out. worked fine until throughput became a problem

turns out chunking strategy matters way more than people admit. if your chunks are too big you're pulling irrelevant signal. too small and you're fragmenting context that should stay together. we were doing fixed-size 512-token chunks because that's what every tutorial shows. switched to semantic chunking based on actual transaction boundaries (like, a full transaction record stays intact even if it's 200 tokens or 800) and our reranker immediately got better at filtering noise

but here's the thing that actually moved the needle: we stopped relying on the reranker to do all the heavy lifting. started doing basic filtering upstream. like, for a fraud check, we don't need to rerank if we can just skip pulling merchant data from 6 months ago when the pattern only matters in the last 14 days. added some stupid-simple heuristics before even hitting the vector db and reduced our context window requirements by like 40%

my finance team is breathing down my neck about consolidating our vendor stack, which sounds great until you realize we'd lose the specialized retrieval layer we built and have to bake everything into the model context somehow. so yeah, i'm very motivated to figure out if we can make this leaner

what i'm curious about: how much are people actually optimizing at the chunking level vs just trusting the reranker? because from what i'm seeing in that benchmark thread, the gap between a well-tuned chunking strategy and a mediocre one is bigger than the gap between reranker models. nobody talks about that

also if anyone's using Sonnet 4.6 for retrieval in production, i'd love to hear how you're handling the cost/latency tradeoff. we're still on 4.7 for our actual scoring because the accuracy difference matters for fraud, but for just pulling context i wonder if we're overspending

feels like everyone's optimizing the wrong parts of the pipeline