-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405
Comments
Aside from performance, there might be bug in the duckdb backend? In [16]: add_lags_ibis_pandas(df_pl, ['a'], [1]) # looks correct
Out[16]:
a b c a_1
0 0.345584 0.164416 -0.407465 0.821618
1 0.821618 -0.562795 -0.062351 0.330437
2 0.330437 1.523051 0.519244 -1.303157
3 -1.303157 0.573010 -1.722321 0.905356
4 0.905356 0.629068 1.522420 0.446375
... ... ... ... ...
9999995 -1.844940 -1.460037 -0.420211 1.426008
9999996 1.426008 -1.786794 1.411938 -1.770208
9999997 -1.770208 -1.234044 1.061642 -0.293813
9999998 -0.293813 -1.985160 -0.134314 -0.870312
9999999 -0.870312 -1.305430 -0.965417 NaN
[10000000 rows x 4 columns]
In [17]: add_lags_ibis_polars(df_pl, ['a'], [1]) # what happened here?
Out[17]:
shape: (10_000_000, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ a ┆ b ┆ c ┆ a_1 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 0.008919 ┆ 0.641422 ┆ 0.376592 ┆ 0.927814 │
│ 0.927814 ┆ -1.595148 ┆ -1.321503 ┆ -0.216802 │
│ -0.216802 ┆ 0.163541 ┆ -0.839665 ┆ 0.627737 │
│ 0.627737 ┆ 0.199048 ┆ -0.055465 ┆ 2.124896 │
│ 2.124896 ┆ -0.484083 ┆ -1.280158 ┆ -0.120967 │
│ … ┆ … ┆ … ┆ … │
│ -0.156736 ┆ -0.267561 ┆ -0.548102 ┆ -1.527322 │
│ -1.527322 ┆ -0.172108 ┆ -0.840629 ┆ 0.883683 │
│ 0.883683 ┆ -1.140378 ┆ -1.907993 ┆ -0.471029 │
│ -0.471029 ┆ -0.854808 ┆ -0.36428 ┆ 2.66472 │
│ 2.66472 ┆ 0.497843 ┆ -0.089645 ┆ 0.050035 │
└───────────┴───────────┴───────────┴───────────┘ |
Regarding the bug: the results are correct up until |
There's no guaranteed ordering of output results without an explicit call to If the 8193rd element differs, that's highly indicative that 8192 is some kind of buffer size for a unit of processing inside duckdb. |
OK thanks, I've updated the example to sort by index afterwards, so that the input order is preserved at the end for the user. But I'll keep sorting out of the benchmark On Zulip you wrote
Is that still the case? Does this require investigation on your end, or is just the kind of operation which one might expect to be slow? |
It requires some investigation. Can you show some concrete numbers for the 10,000x? I see about 280x using this script: script with `time`import pandas as pd
import time
import polars as pl
import numpy as np
import ibis
rng = np.random.default_rng(1)
N = 10_000_000
a = rng.normal(size=N)
b = rng.normal(size=N)
c = rng.normal(size=N)
data = {'a': a, 'b': b, 'c': c}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)
def add_lags_pandas(df, cols, lags):
return pd.concat(
[
df,
*[
df.loc[:, col].shift(-lag).rename(f'{col}_{lag}')
for col in cols
for lag in lags
],
],
axis=1,
)
def add_lags_polars(df, cols, lags):
return df.with_columns(
pl.col(col).shift(-lag).alias(f'{col}_{lag}')
for col in cols
for lag in lags
)
def add_lags_ibis_pandas(df, cols, lags):
ibis.set_backend('pandas')
t = ibis.memtable(df)
t = t.mutate(
ibis._[col].lag(-lag).name(f'{col}_{lag}')
for col in cols
for lag in lags
)
return t.to_pandas()
def add_lags_ibis_polars(df, cols, lags):
# Polars backend does not support this operation
ibis.set_backend('duckdb')
t = ibis.memtable(df)
t = t.mutate(
ibis._[col].lag(-lag).name(f'{col}_{lag}')
for col in cols
for lag in lags
)
return t.to_polars()
start = time.time()
add_lags_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])
stop = time.time()
print(f"polars: {stop-start:.2f}s")
start = time.time()
add_lags_ibis_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])
stop = time.time()
print(f"ibis: {stop-start:.2f}s") |
Also the polars performance varies by almost a factor of |
I suspect I am now hitting the OS page cache, and the first run wasn't :) |
They're in the notebook https://www.kaggle.com/code/marcogorelli/narwhals-could-you-just-use-ibis-instead, which you can fork and run if you like The timings come from running it on Kaggle, running it locally may differ There, I report on the minimum, maximum, and average times:
The numbers in the table at the bottom are the minimum, as my understanding is that that's the best practice when benchmarking https://docs.python.org/3/library/timeit.html#module-timeit
|
In general when making these comparisons it would be really helpful if you could compare with the duckdb SQL API, so that we can rule out Ibis as the source of the problem here and funnel the report upstream. |
Fascinating results I'm seeing:
I gave duckdb every advantage here:
I'm going to repurpose this issue for the ibis/duckdb discrepancy and open up an issue on the duckdb tracker for the lag performance. That said, I think this is a weird benchmark, it's not really representative of any particular workload. Data are hardly ever normally distributed randomly generated numbers. |
Thanks for your response 🙏
The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/163684
It's the efficiency As noted on Zulip, I posted this in order to answer the question which was posed to me by someone at Voltron: "why don't you just use Ibis?", after I'd presented Narwhals and how scikit-lego is using it at PyCon Italy. I tried to answer that question by taking a scikit-lego function, making some random data, and timing the relative overheads of Narwhals vs Ibis. Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too 🤗 Or we can loop in the person who asked the question if they meant something different by "why don't you just use Ibis?" - the ball was thrown in my court, I'm just trying to respond 😄 |
Indeed, lagging itself is a common operation for sure!
I still think framing this as an Ibis vs Narwhals is misleading. The performance difference observed has basically nothing to do with Ibis, it's concentrated in DuckDB, so this is about Polars and DuckDB, not Ibis and Narwhals. |
That said, it's up to you. I'd just ask that you link to this issue so folks get the full context. |
The purpose of Narwhals is to enable library maintainers to write functions such as:
When I was asked "why not use Ibis", they presumably meant "why not use Ibis for step 2"? Would you agree? If so, do you have a better suggestion for how to answer your colleague's question?
Sure 😇 |
Presumably if narwhals had duckdb support then this question wouldn't be that interesting to answer, because you'd see basically the same thing you'd see with Ibis. |
The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation" |
DuckDB is out-of-scope for Narwhals, happy to leave that to you!
that, and that it makes pandas noticeably slower, so it would have an impact on the existing pandas user base 😉 I won't publish this in public anyway (at least, not the duckdb part) |
Yep, and we've already addressed that elsewhere. |
Created the DuckDB issue here. |
Thanks! Like you, I'm a fan of DuckDB, so if the end result is that things improve in DuckDB, that's a win for everyone
I'll just keep this result in my backpocket then in case a "Ibis is faster than Polars"-style post gets published then 😉 Thanks for engaging, have a nice day! 🌞 |
Closing this out as it's being tracked in duckdb/duckdb#12600. |
What happened?
I reported this here, and was asked to open an issue
What version of ibis are you using?
9.1.0
What backend(s) are you using, if any?
duckdb
Relevant log output
The notebook where I noticed this is here
Here's a smaller snippet to reproduce:
and then just run
and
Code of Conduct
The text was updated successfully, but these errors were encountered: