-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Define a Dataset protocol based on Substrait and C Data Interface #37504
Comments
Is schema negotiation outside the scope of this protocol? If |
I think we can include that. I'd like to design that as part of the PyCapsule API first, so we match the semantics there. |
Haven't had time to work on this, but wanted to note here a current pain point for users of the dataset API is that there aren't table statistics the caller can access, and this leads to bad join orders. Some mentions of this here: https://twitter.com/mim_djo/status/1740542585410814393 |
Are we sure a blocking API like this would be palatable for existing execution engines such as Acero, DuckDB... ? Of course, at worse the various method/function calls can be offloaded to a dedicated thread pool. |
Are you referring to the fact they would have to acquire the GIL to call these methods? Or something else? Ideally all these methods are brief. Though I haven't discussed this in depth with implementors of query engines. I'd be curious for their thoughts. |
No, to the fact that these functions are synchronous.
I'm not sure. |
Perhaps |
An |
Describe the enhancement requested
Based on discussion in the 2023-08-30 Arrow community meeting. This is a continuation of #35568 and #33986.
We'd like to have a protocol for sharing unmaterialized datasets that:
This would provide a extendible connection between scanners and query engines. Data formats might include Iceberg, Delta Lake, Lance, and PyArrow datasets (parquet, JSON, CSV). Query engines could include DuckDB, DataFusion, Polars, PyVelox, PySpark, Ray, and Dask. Such a connection would let end-users employ their preferred query engine to load any supported dataset. From their perspective, usage would might look like:
The protocol is largely invisible to the user. Behind the scenes,
duckdb
would call__arrow_scanner__()
ontable
to get a scannable object. It would then pass down the column selection['y']
and the filterx > 3
to the scanner, and get the get the resulting data stream as input to the query.Shape of the protocol
The overall shape would look roughly like:
Data and schema are returned as C Data Interface objects (see: #35531). Predicates are passed as Substrait extended expressions.
Component(s)
Python
The text was updated successfully, but these errors were encountered: