Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: use union categorical to concatenate pages in to_dataframe when categorical dtype is requested #8044

Closed
dkapitan opened this issue May 20, 2019 · 4 comments · Fixed by #10180
Assignees
Labels
api: bigquery Issues related to the BigQuery API. Python 3 Only This work would involve supporting Python 3 only code paths. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@dkapitan
Copy link

I am working on an enhancement for pandas-gbq, where we would like to reduce the memory footprint in pandas through downcasting. See googleapis/python-bigquery-pandas#275 for the full details.

I have made a first working version and getting erratic behaviour in the conversion of STRING to pandas category. It seems like it has to do with the size of the query results.

See attached notebook for the code example.
pandas-gbq-bugtracing.ipynb.zip

@tseaver tseaver added api: bigquery Issues related to the BigQuery API. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels May 20, 2019
@tseaver
Copy link
Contributor

tseaver commented May 20, 2019

@tswast I know you have some stuff in work related to the dtypes conversions.

@tseaver tseaver changed the title Type conversion to pd.Dataframe with dtypes option is erratic BigQuery: type conversion to pd.Dataframe with dtypes option is erratic May 21, 2019
@tswast
Copy link
Contributor

tswast commented May 23, 2019

If the size of the query results is large, then there are multiple pages to parse. Since we construct a dataframe for each page and then concatenate them, the results may not be as expected. Per: http://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#concatenation

When two categorical series are concatenated with different types, they get converted to the "object" dtype. It seems this might be avoided with https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.union_categoricals.html

I'd love if there was a way to support this feature without having to add special logic for different dtypes.

Related: per googleapis/python-bigquery-pandas#275 (comment) there are more immediate ways that we can decrease peak memory usage. I've filed #8107 to track that issue separately.

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Nov 16, 2019
@tswast tswast changed the title BigQuery: type conversion to pd.Dataframe with dtypes option is erratic BigQuery: use union categorical to concatenate pages in to_dataframe when categorical dtype is requested Jan 10, 2020
@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed 🚨 This issue needs some love. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jan 10, 2020
@tswast tswast assigned plamut and unassigned tswast Jan 10, 2020
@tswast
Copy link
Contributor

tswast commented Jan 10, 2020

I updated the issue title to reflect that this issue is specifically for the categorical dtype.

#10027 which refactors to_dataframe to use to_arrow will affect this issue. Instead of converting the dtype at each page, that PR changes it to change the dtype when converting from arrow table to pandas dataframe. This may actually fix the issue with using the categorical dtype (but not decrease peak memory usage).

I suggest we at least add tests with categorical dtypes before closing this issue.

@plamut
Copy link
Contributor

plamut commented Jan 20, 2020

FWIW, the peak memory usage can be cut down by a lot using the bqstorage API (benchmark from today).

I will have a look at the remaining scope of this issue (the dtypes).

@tswast tswast added the Python 3 Only This work would involve supporting Python 3 only code paths. label Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. Python 3 Only This work would involve supporting Python 3 only code paths. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants