perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405

MarcoGorelli · 2024-06-19T11:59:59Z

What happened?

I reported this here, and was asked to open an issue

What version of ibis are you using?

9.1.0

What backend(s) are you using, if any?

duckdb

Relevant log output

The notebook where I noticed this is here

Here's a smaller snippet to reproduce:

import pandas as pd
import polars as pl
import numpy as np
import narwhals as nw
import ibis

rng = np.random.default_rng(1)
N = 10_000_000
a = rng.normal(size=N)
b = rng.normal(size=N)
c = rng.normal(size=N)
data = {'a': a, 'b': b, 'c': c}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)


def add_lags_pandas(df, cols, lags):
    return pd.concat(
        [
            df,
            *[
                df.loc[:, col].shift(-lag).rename(f'{col}_{lag}')
                for col in cols
                for lag in lags
            ],
        ],
        axis=1,
    )

def add_lags_polars(df, cols, lags):
    return df.with_columns(
        pl.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )

@nw.narwhalify
def add_lags_narwhals(df, cols, lags):
    return df.with_columns(
        nw.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )

def add_lags_ibis_pandas(df, cols, lags):
    ibis.set_backend('pandas')
    t = ibis.memtable(df)
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_pandas()

def add_lags_ibis_polars(df, cols, lags):
    # Polars backend does not support this operation
    ibis.set_backend('duckdb')
    t = ibis.memtable(df.with_row_index())
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_polars().sort('index').drop('index')

and then just run

add_lags_ibis_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])

and

add_lags_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])

Code of Conduct

I agree to follow this project's Code of Conduct

MarcoGorelli · 2024-06-19T12:08:12Z

Aside from performance, there might be bug in the duckdb backend?

In [16]: add_lags_ibis_pandas(df_pl, ['a'], [1])  # looks correct
Out[16]:
                a         b         c       a_1
0        0.345584  0.164416 -0.407465  0.821618
1        0.821618 -0.562795 -0.062351  0.330437
2        0.330437  1.523051  0.519244 -1.303157
3       -1.303157  0.573010 -1.722321  0.905356
4        0.905356  0.629068  1.522420  0.446375
...           ...       ...       ...       ...
9999995 -1.844940 -1.460037 -0.420211  1.426008
9999996  1.426008 -1.786794  1.411938 -1.770208
9999997 -1.770208 -1.234044  1.061642 -0.293813
9999998 -0.293813 -1.985160 -0.134314 -0.870312
9999999 -0.870312 -1.305430 -0.965417       NaN

[10000000 rows x 4 columns]

In [17]: add_lags_ibis_polars(df_pl, ['a'], [1])  # what happened here?
Out[17]:
shape: (10_000_000, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         ┆ a_1       │
│ ---       ┆ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 0.008919  ┆ 0.641422  ┆ 0.376592  ┆ 0.927814  │
│ 0.927814  ┆ -1.595148 ┆ -1.321503 ┆ -0.216802 │
│ -0.216802 ┆ 0.163541  ┆ -0.839665 ┆ 0.627737  │
│ 0.627737  ┆ 0.199048  ┆ -0.055465 ┆ 2.124896  │
│ 2.124896  ┆ -0.484083 ┆ -1.280158 ┆ -0.120967 │
│ …         ┆ …         ┆ …         ┆ …         │
│ -0.156736 ┆ -0.267561 ┆ -0.548102 ┆ -1.527322 │
│ -1.527322 ┆ -0.172108 ┆ -0.840629 ┆ 0.883683  │
│ 0.883683  ┆ -1.140378 ┆ -1.907993 ┆ -0.471029 │
│ -0.471029 ┆ -0.854808 ┆ -0.36428  ┆ 2.66472   │
│ 2.66472   ┆ 0.497843  ┆ -0.089645 ┆ 0.050035  │
└───────────┴───────────┴───────────┴───────────┘

MarcoGorelli · 2024-06-19T12:20:14Z

Regarding the bug: the results are correct up until N = 8192 (2**13). For N = 8193 onwards, it breaks

cpcloud · 2024-06-19T12:23:21Z

There's no guaranteed ordering of output results without an explicit call to order_by.

If the 8193rd element differs, that's highly indicative that 8192 is some kind of buffer size for a unit of processing inside duckdb.

MarcoGorelli · 2024-06-19T12:28:11Z

OK thanks, I've updated the example to sort by index afterwards, so that the input order is preserved at the end for the user. But I'll keep sorting out of the benchmark

On Zulip you wrote

That last row doesn't look right. Can you open an issue about it with a fully reproducible example?

Is that still the case? Does this require investigation on your end, or is just the kind of operation which one might expect to be slow?

cpcloud · 2024-06-19T12:32:00Z

It requires some investigation.

Can you show some concrete numbers for the 10,000x?

I see about 280x using this script:

script with `time`

import pandas as pd
import time
import polars as pl
import numpy as np
import ibis

rng = np.random.default_rng(1)
N = 10_000_000
a = rng.normal(size=N)
b = rng.normal(size=N)
c = rng.normal(size=N)
data = {'a': a, 'b': b, 'c': c}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)


def add_lags_pandas(df, cols, lags):
    return pd.concat(
        [
            df,
            *[
                df.loc[:, col].shift(-lag).rename(f'{col}_{lag}')
                for col in cols
                for lag in lags
            ],
        ],
        axis=1,
    )

def add_lags_polars(df, cols, lags):
    return df.with_columns(
        pl.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )


def add_lags_ibis_pandas(df, cols, lags):
    ibis.set_backend('pandas')
    t = ibis.memtable(df)
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_pandas()

def add_lags_ibis_polars(df, cols, lags):
    # Polars backend does not support this operation
    ibis.set_backend('duckdb')
    t = ibis.memtable(df)
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_polars()

start = time.time()
add_lags_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])
stop = time.time()

print(f"polars: {stop-start:.2f}s")

start = time.time()
add_lags_ibis_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])
stop = time.time()
print(f"ibis:   {stop-start:.2f}s")

cpcloud · 2024-06-19T12:34:57Z

Also the polars performance varies by almost a factor of 10 5 between runs, not sure how to factor that in here!

cpcloud · 2024-06-19T12:36:20Z

I suspect I am now hitting the OS page cache, and the first run wasn't :)

MarcoGorelli · 2024-06-19T12:37:57Z

Can you show some concrete numbers for the 10,000x?

They're in the notebook https://www.kaggle.com/code/marcogorelli/narwhals-could-you-just-use-ibis-instead, which you can fork and run if you like

The timings come from running it on Kaggle, running it locally may differ

There, I report on the minimum, maximum, and average times:

Polars native
min: 0.0007532539999980751
max: 0.044216634333338334
0.006996648095236782 +/- 0.005743185736385091
Polars via Narwhals
min: 0.0009312336666577418
max: 0.00101599099999324
0.0009707789999966041 +/- 9.109105595347205e-06
Polars dataframe, Ibis with duckdb backend
min: 13.30594426766667
max: 13.773238605666657
13.551140615047618 +/- 0.05954104203620545
pandas native
min: 1.535507855999981
max: 1.6537581420000151
1.5739719158095309 +/- 0.01732459334764224
pandas via Narwhals
min: 0.5989258073333303
max: 0.8326186953333187
0.6507151033333359 +/- 0.0285118691222816
pandas dataframe, Ibis pandas backend
min: 2.6033352826666487
max: 3.050078681666681
2.681404633904755 +/- 0.05706747288901051

The numbers in the table at the bottom are the minimum, as my understanding is that that's the best practice when benchmarking https://docs.python.org/3/library/timeit.html#module-timeit

Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

cpcloud · 2024-06-19T12:38:16Z

In general when making these comparisons it would be really helpful if you could compare with the duckdb SQL API, so that we can rule out Ibis as the source of the problem here and funnel the report upstream.

cpcloud · 2024-06-19T12:46:40Z

Fascinating results I'm seeing:

polars is definitely the fastest here on this operation
duckdb with raw sql is taking up most of the time, with Ibis + duckdb about 2x slower than duckdb alone

I gave duckdb every advantage here:

not timing the connection
creating a table in the native format from polars instead of using their magic view functionality.

I'm going to repurpose this issue for the ibis/duckdb discrepancy and open up an issue on the duckdb tracker for the lag performance.

That said, I think this is a weird benchmark, it's not really representative of any particular workload. Data are hardly ever normally distributed randomly generated numbers.

MarcoGorelli · 2024-06-19T12:53:09Z

Thanks for your response 🙏

not really representative of any particular workload

The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/163684

Data are hardly ever normally distributed randomly generated numbers

It's the efficiency shift operator that makes the difference here, rather than the distribution of the input values

As noted on Zulip, I posted this in order to answer the question which was posed to me by someone at Voltron: "why don't you just use Ibis?", after I'd presented Narwhals and how scikit-lego is using it at PyCon Italy. I tried to answer that question by taking a scikit-lego function, making some random data, and timing the relative overheads of Narwhals vs Ibis.

Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too 🤗

Or we can loop in the person who asked the question if they meant something different by "why don't you just use Ibis?" - the ball was thrown in my court, I'm just trying to respond 😄

cpcloud · 2024-06-19T13:01:56Z

The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting

Indeed, lagging itself is a common operation for sure!

Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too

I still think framing this as an Ibis vs Narwhals is misleading.

The performance difference observed has basically nothing to do with Ibis, it's concentrated in DuckDB, so this is about Polars and DuckDB, not Ibis and Narwhals.

cpcloud · 2024-06-19T13:04:03Z

That said, it's up to you. I'd just ask that you link to this issue so folks get the full context.

MarcoGorelli · 2024-06-19T13:06:01Z

The purpose of Narwhals is to enable library maintainers to write functions such as:

pandas/Polars/etc goes in
something happens under the hood
the user gets pandas/Polars/etc back

When I was asked "why not use Ibis", they presumably meant "why not use Ibis for step 2"? Would you agree?

If so, do you have a better suggestion for how to answer your colleague's question?

I'd just ask that you link to this issue so folks get the full context.

Sure 😇

cpcloud · 2024-06-19T13:07:39Z

Presumably if narwhals had duckdb support then this question wouldn't be that interesting to answer, because you'd see basically the same thing you'd see with Ibis.

cpcloud · 2024-06-19T13:08:34Z

The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"

MarcoGorelli · 2024-06-19T13:11:33Z

DuckDB is out-of-scope for Narwhals, happy to leave that to you!

The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"

that, and that it makes pandas noticeably slower, so it would have an impact on the existing pandas user base 😉

I won't publish this in public anyway (at least, not the duckdb part)

cpcloud · 2024-06-19T13:13:08Z

that, and that it makes pandas noticeably slower

Yep, and we've already addressed that elsewhere.

cpcloud · 2024-06-19T13:24:53Z

Created the DuckDB issue here.

MarcoGorelli · 2024-06-19T13:53:30Z

Thanks! Like you, I'm a fan of DuckDB, so if the end result is that things improve in DuckDB, that's a win for everyone

so this is about Polars and DuckDB

I'll just keep this result in my backpocket then in case a "Ibis is faster than Polars"-style post gets published then 😉

Thanks for engaging, have a nice day! 🌞

cpcloud · 2024-06-21T12:59:59Z

Closing this out as it's being tracked in duckdb/duckdb#12600.

MarcoGorelli added the bug Incorrect behavior inside of ibis label Jun 19, 2024

github-project-automation bot added this to Ibis planning and roadmap Jun 19, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jun 19, 2024

cpcloud changed the title ~~bug: duckdb backend 10,000x slower than Polars for lagged operations~~ perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations Jun 19, 2024

cpcloud added the performance Issues related to ibis's performance label Jun 19, 2024

cpcloud mentioned this issue Jun 19, 2024

DuckDB lag seems to be much slower than Polars duckdb/duckdb#12600

Closed

2 tasks

cpcloud removed the bug Incorrect behavior inside of ibis label Jun 19, 2024

cpcloud closed this as completed Jun 21, 2024

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405

perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

MarcoGorelli commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024 •

edited

Loading

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024 •

edited

Loading

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024

cpcloud commented Jun 21, 2024

perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405

perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405

Comments

MarcoGorelli commented Jun 19, 2024 • edited Loading

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

MarcoGorelli commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024 • edited Loading

cpcloud commented Jun 19, 2024 • edited Loading

MarcoGorelli commented Jun 19, 2024 • edited Loading

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024 • edited Loading

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024 • edited Loading

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024 • edited Loading

MarcoGorelli commented Jun 19, 2024 • edited Loading

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024

cpcloud commented Jun 19, 2024

cpcloud commented Jun 19, 2024

MarcoGorelli commented Jun 19, 2024

cpcloud commented Jun 21, 2024

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024 •

edited

Loading

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024 •

edited

Loading

MarcoGorelli commented Jun 19, 2024 •

edited

Loading

cpcloud commented Jun 19, 2024 •

edited

Loading

MarcoGorelli commented Jun 19, 2024 •

edited

Loading