BigQuery: use union categorical to concatenate pages in `to_dataframe` when categorical dtype is requested #8044

dkapitan · 2019-05-20T21:06:33Z

I am working on an enhancement for pandas-gbq, where we would like to reduce the memory footprint in pandas through downcasting. See googleapis/python-bigquery-pandas#275 for the full details.

I have made a first working version and getting erratic behaviour in the conversion of STRING to pandas category. It seems like it has to do with the size of the query results.

See attached notebook for the code example.
pandas-gbq-bugtracing.ipynb.zip

The text was updated successfully, but these errors were encountered:

tseaver · 2019-05-20T21:28:13Z

@tswast I know you have some stuff in work related to the dtypes conversions.

tswast · 2019-05-23T00:26:31Z

If the size of the query results is large, then there are multiple pages to parse. Since we construct a dataframe for each page and then concatenate them, the results may not be as expected. Per: http://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#concatenation

When two categorical series are concatenated with different types, they get converted to the "object" dtype. It seems this might be avoided with https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.union_categoricals.html

I'd love if there was a way to support this feature without having to add special logic for different dtypes.

Related: per googleapis/python-bigquery-pandas#275 (comment) there are more immediate ways that we can decrease peak memory usage. I've filed #8107 to track that issue separately.

tswast · 2020-01-10T18:20:14Z

I updated the issue title to reflect that this issue is specifically for the categorical dtype.

#10027 which refactors to_dataframe to use to_arrow will affect this issue. Instead of converting the dtype at each page, that PR changes it to change the dtype when converting from arrow table to pandas dataframe. This may actually fix the issue with using the categorical dtype (but not decrease peak memory usage).

I suggest we at least add tests with categorical dtypes before closing this issue.

plamut · 2020-01-20T13:59:06Z

FWIW, the peak memory usage can be cut down by a lot using the bqstorage API (benchmark from today).

I will have a look at the remaining scope of this issue (the dtypes).

tseaver added api: bigquery Issues related to the BigQuery API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels May 20, 2019

tseaver assigned tswast May 20, 2019

tseaver changed the title ~~Type conversion to pd.Dataframe with dtypes option is erratic~~ BigQuery: type conversion to pd.Dataframe with dtypes option is erratic May 21, 2019

dkapitan mentioned this issue May 22, 2019

ENH: option efficient memory use by downcasting googleapis/python-bigquery-pandas#275

Closed

yoshi-automation added the 🚨 This issue needs some love. label Nov 16, 2019

tswast changed the title ~~BigQuery: type conversion to pd.Dataframe with dtypes option is erratic~~ BigQuery: use union categorical to concatenate pages in to_dataframe when categorical dtype is requested Jan 10, 2020

tswast assigned plamut and unassigned tswast Jan 10, 2020

plamut mentioned this issue Jan 22, 2020

test(bigquery): add tests for concatenating categorical columns #10180

Merged

4 tasks

tswast added the Python 3 Only This work would involve supporting Python 3 only code paths. label Jan 22, 2020

plamut closed this as completed in #10180 Jan 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: use union categorical to concatenate pages in `to_dataframe` when categorical dtype is requested #8044

BigQuery: use union categorical to concatenate pages in `to_dataframe` when categorical dtype is requested #8044

dkapitan commented May 20, 2019

tseaver commented May 20, 2019

tswast commented May 23, 2019

tswast commented Jan 10, 2020

plamut commented Jan 20, 2020

BigQuery: use union categorical to concatenate pages in to_dataframe when categorical dtype is requested #8044

BigQuery: use union categorical to concatenate pages in to_dataframe when categorical dtype is requested #8044

Comments

dkapitan commented May 20, 2019

tseaver commented May 20, 2019

tswast commented May 23, 2019

tswast commented Jan 10, 2020

plamut commented Jan 20, 2020

BigQuery: use union categorical to concatenate pages in `to_dataframe` when categorical dtype is requested #8044

BigQuery: use union categorical to concatenate pages in `to_dataframe` when categorical dtype is requested #8044