Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods #296

tswast · 2020-10-05T14:31:02Z

Currently, pandas-gbq calls QueryJob.result() and Client.list_rows() directly

https://github.com/pydata/pandas-gbq/blob/46c579ac21879b431c8568b49e68624f4a5ea25e/pandas_gbq/gbq.py#L561-L564

This is because the max_results parameter is needed, but not available in to_dataframe.

Currently, this is not a problem except for some duplicate code, but it may keep pandas-gbq from benefiting from the "fast query path" changes currently being designed.

The text was updated successfully, but these errors were encountered:

plamut · 2021-06-10T15:08:34Z

@tswast Isn't max_results already set when a query job is created? If one then says:

rows_iterator = query_job.result()
rows_iterator.to_dataframe(max_results=42)

The max_results argument should override the same setting on the query job? And stop iterating through pages/rows once that many rows have been fetched and yielded to the user code?

Also, max_results is not compatible with BQ Storage client, thus if that is used, the code should just fall back to the REST API with a warning?

tswast · 2021-06-10T15:20:28Z

The max_results argument should override the same setting on the query job?

I don't see max_results here?

python-bigquery/google/cloud/bigquery/client.py

Lines 3140 to 3150 in 790d11b

    
           def query( 
        
               self, 
        
               query: str, 
        
               job_config: QueryJobConfig = None, 
        
               job_id: str = None, 
        
               job_id_prefix: str = None, 
        
               location: str = None, 
        
               project: str = None, 
        
               retry: retries.Retry = DEFAULT_RETRY, 
        
               timeout: float = None, 
        
           ) -> job.QueryJob:

Also, max_results is not compatible with BQ Storage client, thus if that is used, the code should just fall back to the REST API with a warning?

I'm pretty sure that's the current behavior.

rows_iterator.to_dataframe(max_results=42)

Oh! No, I didn't mean here. I meant on the QueryJob. There's a QueryJob.to_dataframe() method and a QueryJob.to_arrow() method that call result(). I'd like them to be able to pass a max_results to that.

tswast · 2021-06-10T15:22:20Z

(Looking at this, probably the solution in pandas-gbq is to call .result() with a max_results argument. Not sure why we weren't doing that.)

plamut · 2021-06-10T15:36:30Z

Ah, sorry, I meant the row iterator, yes.

(Looking at this, probably the solution in pandas-gbq is to call .result() with a max_results argument. Not sure why we weren't doing that.)

I thought there was a reason for that in Pandas GBQ and that you wanted to pass max_results later to the rows iterator returned by query_job.result() 😆

Looks like we are almost set then, we just need to make sure that max_results is passed to query_job.result() even when the latter is called indirectly, e.g. through query_job.to_dataframe().

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Oct 5, 2020

meredithslota added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Oct 5, 2020

tswast changed the title ~~Add max_results option to to_dataframe methods~~ Add max_results option to to_dataframe and to_arrow methods Nov 10, 2020

tswast mentioned this issue Nov 10, 2020

feat: add progress bar for to_arrow method #352

Merged

tswast mentioned this issue Dec 17, 2020

refactor to use more logic from google-cloud-bigquery googleapis/python-bigquery-pandas#339

Closed

2 tasks

plamut self-assigned this Jun 10, 2021

tswast changed the title ~~Add max_results option to to_dataframe and to_arrow methods~~ Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods Jun 10, 2021

plamut mentioned this issue Jun 14, 2021

feat: add max_results parameter to some of the QueryJob methods #698

Merged

4 tasks

plamut closed this as completed in #698 Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods #296

Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods #296

tswast commented Oct 5, 2020

plamut commented Jun 10, 2021

tswast commented Jun 10, 2021

tswast commented Jun 10, 2021

plamut commented Jun 10, 2021 •

edited

Loading

Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods #296

Add max_results option to QueryJob.to_dataframe and QueryJob.to_arrow methods #296

Comments

tswast commented Oct 5, 2020

plamut commented Jun 10, 2021

tswast commented Jun 10, 2021

tswast commented Jun 10, 2021

plamut commented Jun 10, 2021 • edited Loading

plamut commented Jun 10, 2021 •

edited

Loading