Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: updates to allow users to set max_stream_count #2039

Merged
merged 25 commits into from
Oct 10, 2024

Conversation

chalmerlowe
Copy link
Collaborator

@chalmerlowe chalmerlowe commented Oct 4, 2024

Adds a function determine_requested_streams() to compare preserve_order and the new argument max_stream_count to determine how many streams to request.

preserve_order (bool): Whether to preserve the order of streams. If True,
    this limits the number of streams to one (more than one cannot guarantee order).
max_stream_count (Union[int, None]]): The maximum number of streams
    allowed. Must be a non-negative number or None, where None indicates
    the value is unset. If `max_stream_count` is set, it overrides
    `preserve_order`.

Fixes #2030 🦕

@chalmerlowe chalmerlowe requested review from a team as code owners October 4, 2024 12:32
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery API. labels Oct 4, 2024
@chalmerlowe chalmerlowe requested review from suzmue and removed request for tswast October 7, 2024 15:54
@chalmerlowe chalmerlowe self-assigned this Oct 7, 2024
@kien-truong
Copy link
Contributor

Shouldn't preserve_order take priority over max_stream_count for correctness ?

@chalmerlowe
Copy link
Collaborator Author

@kien-truong

In the question of which argument overrides which, I could probably be convinced to go either way.

I purposely chose for max_streams_count to override preserve_order and wrote the docstring to state that, because:

  • max_streams_count defaults to None and any user whose in-house code currently makes the call to _download_table_bqstorage() but does NOT include max_streams_count but does have a value for preserve_order will see the same behavior.
  • If however, they go to the trouble of including a value for max_streams_count I would presume that they are doing so with an understanding of the new parameter and how it will affect any previous settings for preserve_order as defined in the docstring.

@suzmue
Copy link
Contributor

suzmue commented Oct 7, 2024

@chalmerlowe @kien-truong

I am also in favor of having preserve_order override the max_streams_count setting. I think this would make it easier to use and understand.

If max_streams_count >= 1 this condition would still be satisfied by setting requested_streams = 1 since max_streams_count isn't a guarantee of any number of streams even in the bq storage api and from the issue, it seems that an import use case is to limit the number of streams, not increase them (though I am sure there are use cases for that as well).

Perhaps an error for when max_streams_count == 0 and preserve_order would make sense to have if we swap to favoring preserve_order, since then they are explicitly requesting both unlimited streams and a single stream.

Edit: Actually I don't think the error is necessary for the same reason as why setting to 1 should always be ok.

@chalmerlowe
Copy link
Collaborator Author

let me cogitate on this. thanks for the input, both of you.

@chalmerlowe
Copy link
Collaborator Author

@suzmue @kien-truong

I revised the logic. Once the CI/CD checks pass, can you approve it, Suzy?

google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
google/cloud/bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved
chalmerlowe and others added 3 commits October 10, 2024 05:12
Co-authored-by: Suzy Mueller <suzmue@google.com>
Co-authored-by: Suzy Mueller <suzmue@google.com>
Co-authored-by: Suzy Mueller <suzmue@google.com>
@chalmerlowe chalmerlowe added the automerge Merge the pull request once unit tests and other checks pass. label Oct 10, 2024
@gcf-merge-on-green gcf-merge-on-green bot merged commit 7372ad6 into main Oct 10, 2024
19 checks passed
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Oct 10, 2024
@gcf-merge-on-green gcf-merge-on-green bot deleted the update-maxstreams branch October 10, 2024 09:32
chalmerlowe added a commit that referenced this pull request Nov 1, 2024
Adds a function `determine_requested_streams()` to compare `preserve_order` and the new argument `max_stream_count` to determine how many streams to request.

```
preserve_order (bool): Whether to preserve the order of streams. If True,
    this limits the number of streams to one (more than one cannot guarantee order).
max_stream_count (Union[int, None]]): The maximum number of streams
    allowed. Must be a non-negative number or None, where None indicates
    the value is unset. If `max_stream_count` is set, it overrides
    `preserve_order`.
```

Fixes #2030 🦕
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make max_stream_count configurable when using Bigquery Storage API
3 participants