feat: Support for Arrow PyCapsule interface #23

kylebarron · 2024-07-26T19:33:44Z

The Arrow PyCapsule Interface defines a way for Python libraries to exchange Arrow data at the binary level without needing to know library-specific APIs of producer or consumer. If an object exposes the __arrow_c_stream__ dunder method, then you can pass that into another Arrow library's constructor, which will call that method to get the PyCapsule object representing a pointer to an Arrow C stream.

I've been working to promote its adoption throughout the Python Arrow ecosystem.

Because the PyCapsule Interface makes it free to share data across libraries, I've also been working on arro3, a minimal alternative to pyarrow.

With this PR, any object that implements this interface will just work with quak:

Since I just had my PyCapsule interface PRs merged in Polars, with the next release of python polars, this will just work, (without going through the DataFrame API; if Polars and DuckDB and Quak all speak Arrow, why go through the DataFrame API?).

DuckDB hasn't implemented the interface itself, so for now I believe you have to go through pyarrow.

kylebarron · 2024-07-26T19:41:01Z

src/quak/_widget.py

+                # arrow_table = pa.RecordBatchReader.from_stream(data)
+                arrow_table = pa.table(data)


Here this is not ideal because pa.table materializes the entire stream in memory at once. I tried with the RecordBatchReader but I think you might need to adjust your queries to actually create a table instead of querying the stream directly. I assume you do multiple queries on the input... and presumably the first query exhausts the input stream and then the second query has no data.

If I uncomment the RecordBatchReader line above instead, I get:

Note that it shows 11k rows in the bottom right, but then doesn't display anything. So I think you need to persist the Arrow stream input manually.

Yeah I figured this is not ideal... I just wasn't sure how to wire up duckdb to query something consistently. I'm guessing something like "CREATE VIEW" from the input stream might allow us register some tables that duckdb can continuously query.

I'm not 100% certain but I don't think you could always use CREATE VIEW here for Arrow streams. Reading https://duckdb.org/docs/api/python/data_ingestion#directly-accessing-dataframes-and-arrow-objects that makes me think the first time you query the view it will exhaust the input stream.

I think we're running into the same issue as in ibis: ibis-project/ibis#9663 (comment) . In particular, the presence of the dunder method doesn't tell you whether the input data is already materialized or a stream. If you know that the input data is an in-memory table, then you can call __arrow_c_stream__ multiple times and get the full data each time. In that case you'd prefer to make a DuckDB view, because then you don't need to copy the input data. But if the input is a stream, then you would need to persist it into a full duckdb table for ongoing queries.

kylebarron · 2024-07-26T19:41:29Z

src/quak/_widget.py

@@ -37,7 +39,10 @@ def __init__(self, data, *, table: str = "df"):
            conn = data
        else:
            conn = duckdb.connect(":memory:")


Do you know if the DuckDB :memory: DB will spill to disk? Or for very large input would it just crash the process?

Do you know if the DuckDB :memory: DB will spill to disk? Or for very large input would it just crash the process?

I believe it spills to disk

I thought I was talking to someone recently who asserted :memory: in particular didn't spill to disk because it doesn't have a path to store a local database. But I'm not sure

Hi just stopping by;

in-memory databases will spill to disk, it will create/use the .tmp folder in the current working directory, a database file is not required to spill to disk.

You can test this out by setting a relatively low memory limit and creating a temp table that exceeds this limit.
Using select * from duckdb_temporary_files() will show you which temporary files were created to back this temporary table

manzt · 2024-07-26T23:01:33Z

src/quak/__init__.py

+            else:
+                raise ValueError()


These branches aren't mutually exclusive (we grab arrow from the duckdb so that it is captured in the next block), and we don't want to raise a ValueError for displaying stuff that can't be visualized with quak.

Right now

"hello world"

Would raise a value error since it doesn't match any of the conditions.

we don't want to raise a ValueError for displaying stuff that can't be visualized with quak

Oh I see; this is probably the error I was hitting when I tried to repr something else (why I tried to unload quak)

we grab arrow from the duckdb

As a general note, why do you convert it to Arrow if you're passing it back to duckdb? I think duckdb can do replacement scans on DuckDBPyRelation objects.

kylebarron · 2024-08-13T17:08:57Z

I made a few changes and now I think this should be ready for review as long as we're ok materializing the full Arrow input stream. I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.

manzt · 2024-08-13T17:34:53Z

I think it may be better to not materialize the full Arrow input (especially because DuckDB is capable of larger-than-memory datasets) but that would require creating a DuckDB table from the data input.

I agree. Do you think we could change the behavior in a follow up PR or would that be something we want to handle here? The main thing we need to do is give a table_name for the front end to dispatch queries against.

We could create a duckdb pyrelation from the arrow stream, for example this API is already supported:

conn = duckdb.connect(":memory:)
conn.sql("CREATE VIEW df AS SELECT * FROM ???") # consume arrow c stream
widget = quak.Widget(conn, table="df")

kylebarron · 2024-08-13T17:40:09Z

I think it's fine to do in a follow up PR; up to you

manzt · 2024-08-13T17:41:05Z

Ok, let's just merge for now!

manzt · 2024-08-13T17:45:20Z

https://github.com/manzt/quak/releases/tag/v0.1.6

Support for Arrow PyCapsule interface

f041431

kylebarron commented Jul 26, 2024

View reviewed changes

kylebarron mentioned this pull request Jul 26, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

manzt reviewed Jul 26, 2024

View reviewed changes

manzt force-pushed the main branch from 6bf0c73 to 5267d98 Compare August 9, 2024 19:06

kylebarron added 3 commits August 13, 2024 13:00

Merge branch 'main' into kyle/pycapsule-interface

ee389e8

Update check for formatter

e49ddff

Add notes

d866aa0

manzt changed the title ~~Support for Arrow PyCapsule interface~~ feat: Support for Arrow PyCapsule interface Aug 13, 2024

manzt merged commit 28d46ce into manzt:main Aug 13, 2024
3 checks passed

manzt mentioned this pull request Aug 14, 2024

Arrow string_view raises DuckDB NotImplementedException #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support for Arrow PyCapsule interface #23

feat: Support for Arrow PyCapsule interface #23

kylebarron commented Jul 26, 2024

kylebarron Jul 26, 2024 •

edited

Loading

manzt Jul 26, 2024 •

edited

Loading

kylebarron Jul 26, 2024 •

edited

Loading

kylebarron Jul 26, 2024

manzt Jul 26, 2024

kylebarron Jul 26, 2024

Tishj Aug 8, 2024

manzt Jul 26, 2024

kylebarron Jul 27, 2024

kylebarron Aug 13, 2024

kylebarron commented Aug 13, 2024

manzt commented Aug 13, 2024 •

edited

Loading

kylebarron commented Aug 13, 2024

manzt commented Aug 13, 2024

manzt commented Aug 13, 2024

		# arrow_table = pa.RecordBatchReader.from_stream(data)
		arrow_table = pa.table(data)

feat: Support for Arrow PyCapsule interface #23

feat: Support for Arrow PyCapsule interface #23

Conversation

kylebarron commented Jul 26, 2024

kylebarron Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

manzt Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

kylebarron Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

kylebarron Jul 26, 2024

Choose a reason for hiding this comment

manzt Jul 26, 2024

Choose a reason for hiding this comment

kylebarron Jul 26, 2024

Choose a reason for hiding this comment

Tishj Aug 8, 2024

Choose a reason for hiding this comment

manzt Jul 26, 2024

Choose a reason for hiding this comment

kylebarron Jul 27, 2024

Choose a reason for hiding this comment

kylebarron Aug 13, 2024

Choose a reason for hiding this comment

kylebarron commented Aug 13, 2024

manzt commented Aug 13, 2024 • edited Loading

kylebarron commented Aug 13, 2024

manzt commented Aug 13, 2024

manzt commented Aug 13, 2024

kylebarron Jul 26, 2024 •

edited

Loading

manzt Jul 26, 2024 •

edited

Loading

kylebarron Jul 26, 2024 •

edited

Loading

manzt commented Aug 13, 2024 •

edited

Loading