Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make chunk_size in get_as_arrow an optional keyword argument #2998

Closed
prrao87 opened this issue Mar 6, 2024 · 8 comments · Fixed by #3084
Closed

Make chunk_size in get_as_arrow an optional keyword argument #2998

prrao87 opened this issue Mar 6, 2024 · 8 comments · Fixed by #3084
Assignees
Labels
question Further information is requested usability Issues related to better usability experience, including bad error messages

Comments

@prrao87
Copy link
Member

prrao87 commented Mar 6, 2024

Currently, chunk_size is a positional argument in Python. This requires a new user to experience a runtime error before realizing this fact. Pandas has no such requirement (get_as_df has no arguments). We could instead make the chunk_size an optional kwarg, and set it to a default of 10000.

In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.

@prrao87
Copy link
Member Author

prrao87 commented Mar 6, 2024

Alternate thought: is the chunk_size argument truly necessary? This is worth thinking so that the Arrow, Pandas and Polars methods are all just as simple from a user perspective.

@prrao87 prrao87 added question Further information is requested usability Issues related to better usability experience, including bad error messages labels Mar 6, 2024
@ray6080
Copy link
Contributor

ray6080 commented Mar 7, 2024

Should've given that a better name. It is intended to set rows_per_record_batch, as we internally pull data into record batches, which form the table. I think it is still reasonable to give users the flexibility to set the size of record batches, but happy to hear more feedback on this. (at least we should provide a default value. that was a careless decision when I first worked on it 😅)

Just for reference, I took a look at the similar interface in DuckDB:

fetch_arrow_table(self: duckdb.duckdb.DuckDBPyConnection, rows_per_batch: int = 1000000) → pyarrow.lib.Table

@prrao87
Copy link
Member Author

prrao87 commented Mar 7, 2024

No worries, I think the getAsArrow parameter from C++ spilled over into Python too :). Renaming the parameter would introduce breaking changes, so do you think it makes sense to leave it as it is, but make the Python version a keyword argument with a default value of 1M just like DuckDB?

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Mar 8, 2024

In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.

FYI: I only did this because there is currently no option to get back a single chunk (which we usually prefer, otherwise we need to rechunk the returned data when n_chunks > 1).

Also, in #3009 (a PR I made a few minutes ago) I have actually made the polars chunk size adaptive, targeting ~10m elements per chunk to improve the chances that we don't have to rechunk. As a general rule it's a better idea to target the total number of elements being loaded rather than the number of rows, because you could have 1 column or 1000 columns; that would be a three orders of magnitude difference in how much you're loading (and allocating) if you only target n_rows without reference to n_cols ;)

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Mar 8, 2024

I'd suggest a default of None, triggering an adaptive chunk_size (similar to the PR referenced above, but targeting a slightly less aggressive 1m elements, perhaps?), and additionally allow -1 for no chunking (which I'd immediately switch to for the polars frame export). Any other positive integer would continue the current behaviour (so it would be a zero-breakage change).

  • chunk_size = None (adaptive)
  • chunk_size = <n> (current behaviour)
  • chunk_size = -1 (no chunking, eg: return a single chunk)

@prrao87
Copy link
Member Author

prrao87 commented Mar 8, 2024

Looks great overall! The adaptive chunk size makes a lot of sense from a polars and python perspective, but the underlying get_as_arrow method would also need to serve other use cases like a future arrow-only backend for pandas, so the default chunk size of 1M (as DuckDB does) makes sense for now, I think? The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.

@ray6080 and @mewim will have more thoughts on this too, so will leave it to them.

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Mar 8, 2024

The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.

Yup; the parameter would stay the same, you'd just use the number of columns (which I believe would always be known when this is called) to dynamically influence the value you pass to it (if not already explicitly set).

I've long advocated against APIs that target n_rows, ever since seeing a real-life case back in my old job where the number of columns returned could vary between one and several thousand, but the chunk size (given as n_rows) remained the same - and people were surprised when some result sets would crash because sufficient memory could not be allocated :))

@mewim
Copy link
Member

mewim commented Mar 10, 2024

Yeah I think targeting number of values makes sense, we should have a way for the user to pass in that and also have a way of returning everything as a single chunk.

prrao87 added a commit that referenced this issue Mar 21, 2024
* Fix #2998: Arrow chunk_size as keyword argument

* Adaptive chunk size logic for get_as_arrow

* Run formatter

* Fix missing kwarg

* Fix chunk sizes for arrow tests

* Provide polars users the means to customize chunk_size

* test small chunk_size for polars and arrow

* Revert to small test chunk sizes

* Rework chunk_size defaults and conditions

* Fix conditional logic

* Cover 0, -1, None and fixed int params for chunk_size in tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested usability Issues related to better usability experience, including bad error messages
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants