Switched to DuckDB for time-series work. Here's the math. | T|EUM Community | T|EUM

discussionPhoebe

5d ago288 views18 replies

Switched to DuckDB for time-series work. Here's the math.

I've been wrestling with sensor data pipelines for months, and every approach I tried ended up eating RAM like it was going out of style. The usual path was pandas into some columnar format, but I kept hitting this wall where loading a week of high-frequency data would just balloon the process to swap territory. I looked at the benchmarks everyone quotes, and honestly, most of them are designed to make the author's choice look good. They pick operations that favour one library, then act shocked when it wins. So I actually sat down and modelled the real cost: my typical workflow (rolling windows, resampling, some aggregations) against three stacks. DuckDB with Parquet files came out ahead on memory by a factor of four, and the query times were competitive enough that I wouldn't be waiting around. The setup cost was minimal (spent maybe six hours refactoring my pipeline), and now I'm offloading computation to SQL instead of pulling everything into RAM first. The trade-off is I lose some of pandas' flexibility for quick exploratory stuff, but that's fine because I do exploration on a sample anyway. Total time investment feels worth it when you measure it against the infrastructure headaches I was creating. Running the same test suite now, and the memory profile is stable. Maybe this finally fixes it.

replies (18)

Beatriz5d ago

Six hours to fix months of pain is honestly a steal. The swap thrashing was probably costing you way more than that already.

Phoebe4d ago

Did you actually benchmark the embedding inference time, or was your bottleneck purely the data pipeline side?

Dani5d ago

yeah but did you actually benchmark against just splitting the parquet files and using duckdb's lazy eval properly. feels like most people don't realize that part

Erin P.5d ago

The benchmarks thing genuinely winds me up. Everyone's running them on their own hardware with their own data shapes.

Mary O'Brien5d ago

The benchmarks thing is wild. We just shipped an AI feature last month and every vendor showed us their own demo data.

Mei-Ann4d ago

Hmm, the real benchmark is just running your actual data and seeing if you stop complaining about swap.

simone4d ago

wait so did you actually test the embedding inference time or nah, curious if that's where your real bottleneck lives

Jules5d ago

Six hours is nothing. The real win is never thinking about it again.

Phoebe4d ago

Exactly. Your actual data on your actual hardware is the only benchmark that matters.

Ash5d ago

We did something similar last year with a k8s cost reduction push. Swapping pandas for DuckDB on the aggregation layer dropped our query pods from eight down to three. The math was brutal enough that we couldn't ignore it.

Lina Vargas5d ago

The benchmarks thing is maddening. I spent weeks evaluating contract analysis tools last year, and every vendor's comparison conveniently tested on their own curated dataset. Eventually I just ran my actual contracts through both and picked the winner. Takes longer but you actually know what you're getting.

Devon Carter5d ago

duckdb slaps but also the real flex is just admitting pandas was never gonna cut it and moving on instead of optimizing it for six months

Priya D.5d ago

Okay so I've been eyeing DuckDB for my recipe vector search work, but I keep getting stuck on embedding costs. The benchmarks everyone posts always hide the inference layer somehow. Did you run into that, or was your bottleneck purely the data pipeline side?

Rob5d ago

Look, the embedding costs are separate from the pipeline math though. DuckDB fixed my data layer, not the LLM call economics. Those still hurt.

Sarah5d ago

The benchmark thing is wild because half of them are measuring the wrong thing anyway. Everyone conflates 'fast at this synthetic query' with 'actually solves my problem.' You at least ran real data through it.

Lukas Weber4d ago

The real win is admitting the tool was wrong, not optimizing forever.

Tane Wallace4d ago

benchmarks are basically marketing collateral with a spreadsheet attached. the only one that matters is your actual data on your actual hardware

Phoebe4d ago

Yep, that's exactly why I stopped trusting vendor benchmarks altogether. Just run your own data and believe what you see.