-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: investigate discrepancies between Polars Python native code vs Ibis w/ the Polars backend #8050
Comments
Ibis is converting its expressions to code the backend can execute -- generally SQL, but for Polars native Polars Python code. the overhead in this is minimal Ibis is likely generating poorly performing Polars code somewhere. thanks for sharing your examples! I had recently done some tpc-h benchmarking and didn't notice much difference between Ibis with the Polars backend and native Polars code, but did notice a massive increase for the one billion row challenge code here: #8004 so I suspect this is specific to Polars, we need to investigate |
TL; DR: queries appear faster because the polars versions are running against in-memory data, while the Ibis queries are running against The code in these notebooks is not close enough in equivalence to make meaningful statements about performance, across multiple dimensions. Let's look at query 1. Ibis codevar_1 = datetime(1998, 9, 2)
q = line_item_ds
q = q.mutate(
disc_price=q["l_extendedprice"] * (1 - q["l_discount"]),
charge=q["l_extendedprice"] * (1 - q["l_discount"]) * (1 + q["l_tax"])
).cache()
q_final = (
q.filter(q["l_shipdate"] <= var_1)
.group_by(["l_returnflag", "l_linestatus"])
.agg(
[
q["l_quantity"].sum().name("sum_qty"),
q["l_extendedprice"].sum().name("sum_base_price"),
q["disc_price"].sum().name("sum_disc_price"),
q["charge"].sum().name("sum_charge"),
q["l_quantity"].mean().name("avg_qty"),
q["l_extendedprice"].mean().name("avg_price"),
q["l_discount"].mean().name("avg_disc"),
q.count().name("count_order")
]
)
.order_by(["l_returnflag", "l_linestatus"])
)
q_final.execute().head()
All of these After doing the following I get similar performance for query 1:
Here's the ibis code for that for query 1: var_1 = datetime(1998, 9, 2)
q = line_item_ds
q_final = (
q.filter(q["l_shipdate"] <= var_1)
.group_by(["l_returnflag", "l_linestatus"])
.agg(
[
q["l_quantity"].sum().name("sum_qty"),
q["l_extendedprice"].sum().name("sum_base_price"),
(q["l_extendedprice"] * (1 - q["l_discount"])).sum().name("sum_disc_price"),
(q["l_extendedprice"] * (1 - q["l_discount"]) * (1 + q["l_tax"])).sum().name("sum_charge"),
q["l_quantity"].mean().name("avg_qty"),
q["l_extendedprice"].mean().name("avg_price"),
q["l_discount"].mean().name("avg_disc"),
q.count().name("count_order")
]
)
.order_by(["l_returnflag", "l_linestatus"])
)
q_final.execute().head( In general you can see the difference in the Polars code produced by Ibis versus handwritten by comparing the results of calling print(ibis_expr.compile().explain())
print(polars_expr.explain()) We can now compare plans. Polars:
Ibis:
It looks like ibis is generating a bit of unnecessary code (the aggregation @kunishou What's going on with that initial scan? Are these even running against the same type of scan? It doesn't look like it. The Ibis code is running against a parquet file, while the polars code is running against in memory data. We're happy to look into this more if you can make these queries more comparable. The first and likely most impactful step would be to adjust def _scan_ds(path: str):
path = f"{path}.{FILE_TYPE}"
if FILE_TYPE == "parquet":
scan = pl.scan_parquet(path)
elif FILE_TYPE == "feather":
scan = pl.scan_ipc(path)
else:
raise ValueError(f"file type: {FILE_TYPE} not expected")
if INCLUDE_IO:
return scan
return scan.collect().rechunk().lazy() to return |
I just ran these all after converting Some queries with ibis were faster than with native Polars, though it's probably not anything systematic there. |
@lostmygithubaccount |
Thanks for bringing this to our attention! It looks like there's something to look at for query 2! |
@cpcloud Thank you very much! |
@kunishou Awesome, great to hear! Let us know how we can help. |
edit from @lostmygithubaccount: we'll re-purpose this issue to investigate the q2 performance issue noticed below and the "one billion row challenge" performance issue noticed w/ the Polars backend. I may or may not investigate myself, otherwise we should dig into why these queries are slower on Ibis
What happened?
Hello.
I recently started using ibis. I'm interested in whether there is a significant difference in processing speed between the original backend and the backend operated through Ibis. To investigate this, I rewrote the Polars queries of the pola-rs/tpch benchmark for Ibis , set the backend to Polars , and executed six queries. As a result , the processing speed with Ibis-Polars was significantly slower than with original Polars. Could this be due to the process of converting the Ibis API for use with Polars ? If there is any mistake in how I'm using Ibis, please point it out.
What version of ibis are you using?
7.2.0
What backend(s) are you using, if any?
Polars
Relevant log output
https://colab.research.google.com/drive/1JCKJtDy2jOkRQEbbW_sb6MgEr2vqiyd5?usp=sharing
https://colab.research.google.com/drive/1RohMogghA7xx4GDwWn73qBO1w6frDM7y?usp=sharing
Code of Conduct
The text was updated successfully, but these errors were encountered: