Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods #296

Closed
tswast opened this issue Oct 5, 2020 · 4 comments · Fixed by #698
Closed

Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods #296

tswast opened this issue Oct 5, 2020 · 4 comments · Fixed by #698
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Oct 5, 2020

Currently, pandas-gbq calls QueryJob.result() and Client.list_rows() directly

https://github.com/pydata/pandas-gbq/blob/46c579ac21879b431c8568b49e68624f4a5ea25e/pandas_gbq/gbq.py#L561-L564

This is because the max_results parameter is needed, but not available in to_dataframe.

Currently, this is not a problem except for some duplicate code, but it may keep pandas-gbq from benefiting from the "fast query path" changes currently being designed.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Oct 5, 2020
@meredithslota meredithslota added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Oct 5, 2020
@tswast tswast changed the title Add max_results option to to_dataframe methods Add max_results option to to_dataframe and to_arrow methods Nov 10, 2020
@plamut plamut self-assigned this Jun 10, 2021
@plamut
Copy link
Contributor

plamut commented Jun 10, 2021

@tswast Isn't max_results already set when a query job is created? If one then says:

rows_iterator = query_job.result()
rows_iterator.to_dataframe(max_results=42)

The max_results argument should override the same setting on the query job? And stop iterating through pages/rows once that many rows have been fetched and yielded to the user code?

Also, max_results is not compatible with BQ Storage client, thus if that is used, the code should just fall back to the REST API with a warning?

@tswast
Copy link
Contributor Author

tswast commented Jun 10, 2021

The max_results argument should override the same setting on the query job?

I don't see max_results here?

def query(
self,
query: str,
job_config: QueryJobConfig = None,
job_id: str = None,
job_id_prefix: str = None,
location: str = None,
project: str = None,
retry: retries.Retry = DEFAULT_RETRY,
timeout: float = None,
) -> job.QueryJob:

Also, max_results is not compatible with BQ Storage client, thus if that is used, the code should just fall back to the REST API with a warning?

I'm pretty sure that's the current behavior.

rows_iterator.to_dataframe(max_results=42)

Oh! No, I didn't mean here. I meant on the QueryJob. There's a QueryJob.to_dataframe() method and a QueryJob.to_arrow() method that call result(). I'd like them to be able to pass a max_results to that.

@tswast tswast changed the title Add max_results option to to_dataframe and to_arrow methods Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods Jun 10, 2021
@tswast
Copy link
Contributor Author

tswast commented Jun 10, 2021

(Looking at this, probably the solution in pandas-gbq is to call .result() with a max_results argument. Not sure why we weren't doing that.)

@plamut
Copy link
Contributor

plamut commented Jun 10, 2021

Ah, sorry, I meant the row iterator, yes.

(Looking at this, probably the solution in pandas-gbq is to call .result() with a max_results argument. Not sure why we weren't doing that.)

I thought there was a reason for that in Pandas GBQ and that you wanted to pass max_results later to the rows iterator returned by query_job.result() 😆

Looks like we are almost set then, we just need to make sure that max_results is passed to query_job.result() even when the latter is called indirectly, e.g. through query_job.to_dataframe().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants