-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make chunk_size
in get_as_arrow
an optional keyword argument
#2998
Comments
Alternate thought: is the |
Should've given that a better name. It is intended to set Just for reference, I took a look at the similar interface in DuckDB: fetch_arrow_table(self: duckdb.duckdb.DuckDBPyConnection, rows_per_batch: int = 1000000) → pyarrow.lib.Table |
No worries, I think the |
FYI: I only did this because there is currently no option to get back a single chunk (which we usually prefer, otherwise we need to rechunk the returned data when n_chunks > 1). Also, in #3009 (a PR I made a few minutes ago) I have actually made the polars chunk size adaptive, targeting ~10m elements per chunk to improve the chances that we don't have to rechunk. As a general rule it's a better idea to target the total number of elements being loaded rather than the number of rows, because you could have 1 column or 1000 columns; that would be a three orders of magnitude difference in how much you're loading (and allocating) if you only target n_rows without reference to n_cols ;) |
I'd suggest a default of
|
Looks great overall! The adaptive chunk size makes a lot of sense from a polars and python perspective, but the underlying @ray6080 and @mewim will have more thoughts on this too, so will leave it to them. |
Yup; the parameter would stay the same, you'd just use the number of columns (which I believe would always be known when this is called) to dynamically influence the value you pass to it (if not already explicitly set). I've long advocated against APIs that target n_rows, ever since seeing a real-life case back in my old job where the number of columns returned could vary between one and several thousand, but the chunk size (given as n_rows) remained the same - and people were surprised when some result sets would crash because sufficient memory could not be allocated :)) |
Yeah I think targeting number of values makes sense, we should have a way for the user to pass in that and also have a way of returning everything as a single chunk. |
* Fix #2998: Arrow chunk_size as keyword argument * Adaptive chunk size logic for get_as_arrow * Run formatter * Fix missing kwarg * Fix chunk sizes for arrow tests * Provide polars users the means to customize chunk_size * test small chunk_size for polars and arrow * Revert to small test chunk sizes * Rework chunk_size defaults and conditions * Fix conditional logic * Cover 0, -1, None and fixed int params for chunk_size in tests
Currently,
chunk_size
is a positional argument in Python. This requires a new user to experience a runtime error before realizing this fact. Pandas has no such requirement (get_as_df
has no arguments). We could instead make thechunk_size
an optional kwarg, and set it to a default of 10000.In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.
The text was updated successfully, but these errors were encountered: