Make `chunk_size` in `get_as_arrow` an optional keyword argument #2998

prrao87 · 2024-03-06T20:34:01Z

Currently, chunk_size is a positional argument in Python. This requires a new user to experience a runtime error before realizing this fact. Pandas has no such requirement (get_as_df has no arguments). We could instead make the chunk_size an optional kwarg, and set it to a default of 10000.

In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.

The text was updated successfully, but these errors were encountered:

prrao87 · 2024-03-06T20:48:11Z

Alternate thought: is the chunk_size argument truly necessary? This is worth thinking so that the Arrow, Pandas and Polars methods are all just as simple from a user perspective.

ray6080 · 2024-03-07T08:53:30Z

Should've given that a better name. It is intended to set rows_per_record_batch, as we internally pull data into record batches, which form the table. I think it is still reasonable to give users the flexibility to set the size of record batches, but happy to hear more feedback on this. (at least we should provide a default value. that was a careless decision when I first worked on it 😅)

Just for reference, I took a look at the similar interface in DuckDB:

fetch_arrow_table(self: duckdb.duckdb.DuckDBPyConnection, rows_per_batch: int = 1000000) → pyarrow.lib.Table

prrao87 · 2024-03-07T14:13:21Z

No worries, I think the getAsArrow parameter from C++ spilled over into Python too :). Renaming the parameter would introduce breaking changes, so do you think it makes sense to leave it as it is, but make the Python version a keyword argument with a default value of 1M just like DuckDB?

alexander-beedie · 2024-03-08T16:27:53Z

In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.

FYI: I only did this because there is currently no option to get back a single chunk (which we usually prefer, otherwise we need to rechunk the returned data when n_chunks > 1).

Also, in #3009 (a PR I made a few minutes ago) I have actually made the polars chunk size adaptive, targeting ~10m elements per chunk to improve the chances that we don't have to rechunk. As a general rule it's a better idea to target the total number of elements being loaded rather than the number of rows, because you could have 1 column or 1000 columns; that would be a three orders of magnitude difference in how much you're loading (and allocating) if you only target n_rows without reference to n_cols ;)

alexander-beedie · 2024-03-08T16:33:41Z

I'd suggest a default of None, triggering an adaptive chunk_size (similar to the PR referenced above, but targeting a slightly less aggressive 1m elements, perhaps?), and additionally allow -1 for no chunking (which I'd immediately switch to for the polars frame export). Any other positive integer would continue the current behaviour (so it would be a zero-breakage change).

chunk_size = None (adaptive)
chunk_size = <n> (current behaviour)
chunk_size = -1 (no chunking, eg: return a single chunk)

prrao87 · 2024-03-08T16:42:41Z

Looks great overall! The adaptive chunk size makes a lot of sense from a polars and python perspective, but the underlying get_as_arrow method would also need to serve other use cases like a future arrow-only backend for pandas, so the default chunk size of 1M (as DuckDB does) makes sense for now, I think? The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.

@ray6080 and @mewim will have more thoughts on this too, so will leave it to them.

alexander-beedie · 2024-03-08T16:47:53Z

The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.

Yup; the parameter would stay the same, you'd just use the number of columns (which I believe would always be known when this is called) to dynamically influence the value you pass to it (if not already explicitly set).

I've long advocated against APIs that target n_rows, ever since seeing a real-life case back in my old job where the number of columns returned could vary between one and several thousand, but the chunk size (given as n_rows) remained the same - and people were surprised when some result sets would crash because sufficient memory could not be allocated :))

mewim · 2024-03-10T03:18:21Z

Yeah I think targeting number of values makes sense, we should have a way for the user to pass in that and also have a way of returning everything as a single chunk.

* Fix #2998: Arrow chunk_size as keyword argument * Adaptive chunk size logic for get_as_arrow * Run formatter * Fix missing kwarg * Fix chunk sizes for arrow tests * Provide polars users the means to customize chunk_size * test small chunk_size for polars and arrow * Revert to small test chunk sizes * Rework chunk_size defaults and conditions * Fix conditional logic * Cover 0, -1, None and fixed int params for chunk_size in tests

prrao87 assigned ray6080 and mewim Mar 6, 2024

prrao87 added question Further information is requested usability Issues related to better usability experience, including bad error messages labels Mar 6, 2024

prrao87 assigned mxwli Mar 6, 2024

prrao87 added a commit that referenced this issue Mar 19, 2024

Fix #2998: Arrow chunk_size as keyword argument

6728497

prrao87 mentioned this issue Mar 19, 2024

Arrow chunk_size as keyword argument #3084

Merged

prrao87 closed this as completed in #3084 Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `chunk_size` in `get_as_arrow` an optional keyword argument #2998

Make `chunk_size` in `get_as_arrow` an optional keyword argument #2998

prrao87 commented Mar 6, 2024

prrao87 commented Mar 6, 2024

ray6080 commented Mar 7, 2024 •

edited

Loading

prrao87 commented Mar 7, 2024

alexander-beedie commented Mar 8, 2024 •

edited

Loading

alexander-beedie commented Mar 8, 2024 •

edited

Loading

prrao87 commented Mar 8, 2024

alexander-beedie commented Mar 8, 2024 •

edited

Loading

mewim commented Mar 10, 2024

Make chunk_size in get_as_arrow an optional keyword argument #2998

Make chunk_size in get_as_arrow an optional keyword argument #2998

Comments

prrao87 commented Mar 6, 2024

prrao87 commented Mar 6, 2024

ray6080 commented Mar 7, 2024 • edited Loading

prrao87 commented Mar 7, 2024

alexander-beedie commented Mar 8, 2024 • edited Loading

alexander-beedie commented Mar 8, 2024 • edited Loading

prrao87 commented Mar 8, 2024

alexander-beedie commented Mar 8, 2024 • edited Loading

mewim commented Mar 10, 2024

Make `chunk_size` in `get_as_arrow` an optional keyword argument #2998

Make `chunk_size` in `get_as_arrow` an optional keyword argument #2998

ray6080 commented Mar 7, 2024 •

edited

Loading

alexander-beedie commented Mar 8, 2024 •

edited

Loading

alexander-beedie commented Mar 8, 2024 •

edited

Loading

alexander-beedie commented Mar 8, 2024 •

edited

Loading