-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support for Arrow PyCapsule interface #23
Conversation
# arrow_table = pa.RecordBatchReader.from_stream(data) | ||
arrow_table = pa.table(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here this is not ideal because pa.table
materializes the entire stream in memory at once. I tried with the RecordBatchReader
but I think you might need to adjust your queries to actually create a table instead of querying the stream directly. I assume you do multiple queries on the input... and presumably the first query exhausts the input stream and then the second query has no data.
If I uncomment the RecordBatchReader
line above instead, I get:
Note that it shows 11k rows in the bottom right, but then doesn't display anything. So I think you need to persist the Arrow stream input manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I figured this is not ideal... I just wasn't sure how to wire up duckdb to query something consistently. I'm guessing something like "CREATE VIEW" from the input stream might allow us register some tables that duckdb can continuously query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% certain but I don't think you could always use CREATE VIEW
here for Arrow streams. Reading https://duckdb.org/docs/api/python/data_ingestion#directly-accessing-dataframes-and-arrow-objects that makes me think the first time you query the view it will exhaust the input stream.
I think we're running into the same issue as in ibis: ibis-project/ibis#9663 (comment) . In particular, the presence of the dunder method doesn't tell you whether the input data is already materialized or a stream. If you know that the input data is an in-memory table, then you can call __arrow_c_stream__
multiple times and get the full data each time. In that case you'd prefer to make a DuckDB view, because then you don't need to copy the input data. But if the input is a stream, then you would need to persist it into a full duckdb table for ongoing queries.
@@ -37,7 +39,10 @@ def __init__(self, data, *, table: str = "df"): | |||
conn = data | |||
else: | |||
conn = duckdb.connect(":memory:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if the DuckDB :memory:
DB will spill to disk? Or for very large input would it just crash the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if the DuckDB :memory: DB will spill to disk? Or for very large input would it just crash the process?
I believe it spills to disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was talking to someone recently who asserted :memory:
in particular didn't spill to disk because it doesn't have a path to store a local database. But I'm not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi just stopping by;
in-memory databases will spill to disk, it will create/use the .tmp
folder in the current working directory, a database file is not required to spill to disk.
You can test this out by setting a relatively low memory limit and creating a temp table that exceeds this limit.
Using select * from duckdb_temporary_files()
will show you which temporary files were created to back this temporary table
src/quak/__init__.py
Outdated
else: | ||
raise ValueError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These branches aren't mutually exclusive (we grab arrow from the duckdb so that it is captured in the next block), and we don't want to raise a ValueError
for displaying stuff that can't be visualized with quak.
Right now
"hello world"
Would raise a value error since it doesn't match any of the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to raise a
ValueError
for displaying stuff that can't be visualized with quak
Oh I see; this is probably the error I was hitting when I tried to repr
something else (why I tried to unload quak)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we grab arrow from the duckdb
As a general note, why do you convert it to Arrow if you're passing it back to duckdb? I think duckdb can do replacement scans on DuckDBPyRelation
objects.
I made a few changes and now I think this should be ready for review as long as we're ok materializing the full Arrow input stream. I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input. |
I agree. Do you think we could change the behavior in a follow up PR or would that be something we want to handle here? The main thing we need to do is give a We could create a duckdb pyrelation from the arrow stream, for example this API is already supported: conn = duckdb.connect(":memory:)
conn.sql("CREATE VIEW df AS SELECT * FROM ???") # consume arrow c stream
widget = quak.Widget(conn, table="df") |
I think it's fine to do in a follow up PR; up to you |
Ok, let's just merge for now! |
The Arrow PyCapsule Interface defines a way for Python libraries to exchange Arrow data at the binary level without needing to know library-specific APIs of producer or consumer. If an object exposes the
__arrow_c_stream__
dunder method, then you can pass that into another Arrow library's constructor, which will call that method to get the PyCapsule object representing a pointer to an Arrow C stream.I've been working to promote its adoption throughout the Python Arrow ecosystem.
Because the PyCapsule Interface makes it free to share data across libraries, I've also been working on arro3, a minimal alternative to pyarrow.
With this PR, any object that implements this interface will just work with quak:
Since I just had my PyCapsule interface PRs merged in Polars, with the next release of python polars, this will just work, (without going through the DataFrame API; if Polars and DuckDB and Quak all speak Arrow, why go through the DataFrame API?).
DuckDB hasn't implemented the interface itself, so for now I believe you have to go through pyarrow.