Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations #9405

Closed
1 task done
MarcoGorelli opened this issue Jun 19, 2024 · 21 comments
Closed
1 task done
Labels
performance Issues related to ibis's performance

Comments

@MarcoGorelli
Copy link

MarcoGorelli commented Jun 19, 2024

What happened?

I reported this here, and was asked to open an issue

What version of ibis are you using?

9.1.0

What backend(s) are you using, if any?

duckdb

Relevant log output

The notebook where I noticed this is here

Here's a smaller snippet to reproduce:

import pandas as pd
import polars as pl
import numpy as np
import narwhals as nw
import ibis

rng = np.random.default_rng(1)
N = 10_000_000
a = rng.normal(size=N)
b = rng.normal(size=N)
c = rng.normal(size=N)
data = {'a': a, 'b': b, 'c': c}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)


def add_lags_pandas(df, cols, lags):
    return pd.concat(
        [
            df,
            *[
                df.loc[:, col].shift(-lag).rename(f'{col}_{lag}')
                for col in cols
                for lag in lags
            ],
        ],
        axis=1,
    )

def add_lags_polars(df, cols, lags):
    return df.with_columns(
        pl.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )

@nw.narwhalify
def add_lags_narwhals(df, cols, lags):
    return df.with_columns(
        nw.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )

def add_lags_ibis_pandas(df, cols, lags):
    ibis.set_backend('pandas')
    t = ibis.memtable(df)
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_pandas()

def add_lags_ibis_polars(df, cols, lags):
    # Polars backend does not support this operation
    ibis.set_backend('duckdb')
    t = ibis.memtable(df.with_row_index())
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_polars().sort('index').drop('index')

and then just run

add_lags_ibis_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])

and

add_lags_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])

Code of Conduct

  • I agree to follow this project's Code of Conduct
@MarcoGorelli MarcoGorelli added the bug Incorrect behavior inside of ibis label Jun 19, 2024
@MarcoGorelli
Copy link
Author

Aside from performance, there might be bug in the duckdb backend?

In [16]: add_lags_ibis_pandas(df_pl, ['a'], [1])  # looks correct
Out[16]:
                a         b         c       a_1
0        0.345584  0.164416 -0.407465  0.821618
1        0.821618 -0.562795 -0.062351  0.330437
2        0.330437  1.523051  0.519244 -1.303157
3       -1.303157  0.573010 -1.722321  0.905356
4        0.905356  0.629068  1.522420  0.446375
...           ...       ...       ...       ...
9999995 -1.844940 -1.460037 -0.420211  1.426008
9999996  1.426008 -1.786794  1.411938 -1.770208
9999997 -1.770208 -1.234044  1.061642 -0.293813
9999998 -0.293813 -1.985160 -0.134314 -0.870312
9999999 -0.870312 -1.305430 -0.965417       NaN

[10000000 rows x 4 columns]

In [17]: add_lags_ibis_polars(df_pl, ['a'], [1])  # what happened here?
Out[17]:
shape: (10_000_000, 4)
┌───────────┬───────────┬───────────┬───────────┐
│ abca_1       │
│ ------------       │
│ f64f64f64f64       │
╞═══════════╪═══════════╪═══════════╪═══════════╡
│ 0.0089190.6414220.3765920.927814  │
│ 0.927814-1.595148-1.321503-0.216802 │
│ -0.2168020.163541-0.8396650.627737  │
│ 0.6277370.199048-0.0554652.124896  │
│ 2.124896-0.484083-1.280158-0.120967 │
│ …         ┆ …         ┆ …         ┆ …         │
│ -0.156736-0.267561-0.548102-1.527322 │
│ -1.527322-0.172108-0.8406290.883683  │
│ 0.883683-1.140378-1.907993-0.471029 │
│ -0.471029-0.854808-0.364282.66472   │
│ 2.664720.497843-0.0896450.050035  │
└───────────┴───────────┴───────────┴───────────┘

@MarcoGorelli
Copy link
Author

MarcoGorelli commented Jun 19, 2024

Regarding the bug: the results are correct up until N = 8192 (2**13). For N = 8193 onwards, it breaks

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

There's no guaranteed ordering of output results without an explicit call to order_by.

If the 8193rd element differs, that's highly indicative that 8192 is some kind of buffer size for a unit of processing inside duckdb.

@MarcoGorelli
Copy link
Author

MarcoGorelli commented Jun 19, 2024

OK thanks, I've updated the example to sort by index afterwards, so that the input order is preserved at the end for the user. But I'll keep sorting out of the benchmark

On Zulip you wrote

That last row doesn't look right. Can you open an issue about it with a fully reproducible example?

Is that still the case? Does this require investigation on your end, or is just the kind of operation which one might expect to be slow?

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

It requires some investigation.

Can you show some concrete numbers for the 10,000x?

I see about 280x using this script:

script with `time`
import pandas as pd
import time
import polars as pl
import numpy as np
import ibis

rng = np.random.default_rng(1)
N = 10_000_000
a = rng.normal(size=N)
b = rng.normal(size=N)
c = rng.normal(size=N)
data = {'a': a, 'b': b, 'c': c}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)


def add_lags_pandas(df, cols, lags):
    return pd.concat(
        [
            df,
            *[
                df.loc[:, col].shift(-lag).rename(f'{col}_{lag}')
                for col in cols
                for lag in lags
            ],
        ],
        axis=1,
    )

def add_lags_polars(df, cols, lags):
    return df.with_columns(
        pl.col(col).shift(-lag).alias(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )


def add_lags_ibis_pandas(df, cols, lags):
    ibis.set_backend('pandas')
    t = ibis.memtable(df)
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_pandas()

def add_lags_ibis_polars(df, cols, lags):
    # Polars backend does not support this operation
    ibis.set_backend('duckdb')
    t = ibis.memtable(df)
    t = t.mutate(
        ibis._[col].lag(-lag).name(f'{col}_{lag}')
        for col in cols
        for lag in lags
    )
    return t.to_polars()

start = time.time()
add_lags_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])
stop = time.time()

print(f"polars: {stop-start:.2f}s")

start = time.time()
add_lags_ibis_polars(df_pl, ['a', 'b', 'c'], [1,2,3,4,5])
stop = time.time()
print(f"ibis:   {stop-start:.2f}s")

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

Also the polars performance varies by almost a factor of 10 5 between runs, not sure how to factor that in here!

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

I suspect I am now hitting the OS page cache, and the first run wasn't :)

@MarcoGorelli
Copy link
Author

MarcoGorelli commented Jun 19, 2024

Can you show some concrete numbers for the 10,000x?

They're in the notebook https://www.kaggle.com/code/marcogorelli/narwhals-could-you-just-use-ibis-instead, which you can fork and run if you like

The timings come from running it on Kaggle, running it locally may differ

There, I report on the minimum, maximum, and average times:

Polars native
min: 0.0007532539999980751
max: 0.044216634333338334
0.006996648095236782 +/- 0.005743185736385091
Polars via Narwhals
min: 0.0009312336666577418
max: 0.00101599099999324
0.0009707789999966041 +/- 9.109105595347205e-06
Polars dataframe, Ibis with duckdb backend
min: 13.30594426766667
max: 13.773238605666657
13.551140615047618 +/- 0.05954104203620545
pandas native
min: 1.535507855999981
max: 1.6537581420000151
1.5739719158095309 +/- 0.01732459334764224
pandas via Narwhals
min: 0.5989258073333303
max: 0.8326186953333187
0.6507151033333359 +/- 0.0285118691222816
pandas dataframe, Ibis pandas backend
min: 2.6033352826666487
max: 3.050078681666681
2.681404633904755 +/- 0.05706747288901051

The numbers in the table at the bottom are the minimum, as my understanding is that that's the best practice when benchmarking https://docs.python.org/3/library/timeit.html#module-timeit

Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

In general when making these comparisons it would be really helpful if you could compare with the duckdb SQL API, so that we can rule out Ibis as the source of the problem here and funnel the report upstream.

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

Fascinating results I'm seeing:

  1. polars is definitely the fastest here on this operation
  2. duckdb with raw sql is taking up most of the time, with Ibis + duckdb about 2x slower than duckdb alone

I gave duckdb every advantage here:

  1. not timing the connection
  2. creating a table in the native format from polars instead of using their magic view functionality.

I'm going to repurpose this issue for the ibis/duckdb discrepancy and open up an issue on the duckdb tracker for the lag performance.

That said, I think this is a weird benchmark, it's not really representative of any particular workload. Data are hardly ever normally distributed randomly generated numbers.

@MarcoGorelli
Copy link
Author

MarcoGorelli commented Jun 19, 2024

Thanks for your response 🙏

not really representative of any particular workload

The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/163684

Data are hardly ever normally distributed randomly generated numbers

It's the efficiency shift operator that makes the difference here, rather than the distribution of the input values

As noted on Zulip, I posted this in order to answer the question which was posed to me by someone at Voltron: "why don't you just use Ibis?", after I'd presented Narwhals and how scikit-lego is using it at PyCon Italy. I tried to answer that question by taking a scikit-lego function, making some random data, and timing the relative overheads of Narwhals vs Ibis.

Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too 🤗

Or we can loop in the person who asked the question if they meant something different by "why don't you just use Ibis?" - the ball was thrown in my court, I'm just trying to respond 😄

@cpcloud cpcloud changed the title bug: duckdb backend 10,000x slower than Polars for lagged operations perf(duckdb): duckdb backend 10,000x slower than Polars for lagged operations Jun 19, 2024
@cpcloud cpcloud added the performance Issues related to ibis's performance label Jun 19, 2024
@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

The M5 forecasting competition was pretty much all lagged features, I'd say that this is a very common operation in practical machine learning and forecasting

Indeed, lagging itself is a common operation for sure!

Personally, I think this answers the question. If you like to suggest a better or different benchmark (which still addresses the question), I'd be more than happy to time that too

I still think framing this as an Ibis vs Narwhals is misleading.

The performance difference observed has basically nothing to do with Ibis, it's concentrated in DuckDB, so this is about Polars and DuckDB, not Ibis and Narwhals.

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

That said, it's up to you. I'd just ask that you link to this issue so folks get the full context.

@MarcoGorelli
Copy link
Author

The purpose of Narwhals is to enable library maintainers to write functions such as:

  1. pandas/Polars/etc goes in
  2. something happens under the hood
  3. the user gets pandas/Polars/etc back

When I was asked "why not use Ibis", they presumably meant "why not use Ibis for step 2"? Would you agree?

If so, do you have a better suggestion for how to answer your colleague's question?

I'd just ask that you link to this issue so folks get the full context.

Sure 😇

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

Presumably if narwhals had duckdb support then this question wouldn't be that interesting to answer, because you'd see basically the same thing you'd see with Ibis.

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"

@MarcoGorelli
Copy link
Author

DuckDB is out-of-scope for Narwhals, happy to leave that to you!

The only not-misleading answer here IMO is "because Ibis doesn't yet support Polars' lag operation"

that, and that it makes pandas noticeably slower, so it would have an impact on the existing pandas user base 😉

I won't publish this in public anyway (at least, not the duckdb part)

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

that, and that it makes pandas noticeably slower

Yep, and we've already addressed that elsewhere.

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2024

Created the DuckDB issue here.

@cpcloud cpcloud removed the bug Incorrect behavior inside of ibis label Jun 19, 2024
@MarcoGorelli
Copy link
Author

Thanks! Like you, I'm a fan of DuckDB, so if the end result is that things improve in DuckDB, that's a win for everyone

so this is about Polars and DuckDB

I'll just keep this result in my backpocket then in case a "Ibis is faster than Polars"-style post gets published then 😉

Thanks for engaging, have a nice day! 🌞

@cpcloud
Copy link
Member

cpcloud commented Jun 21, 2024

Closing this out as it's being tracked in duckdb/duckdb#12600.

@cpcloud cpcloud closed this as completed Jun 21, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to ibis's performance
Projects
Archived in project
Development

No branches or pull requests

2 participants