Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Fixed pandas DataFrames being returned with incorrect index. #7953

Merged
merged 2 commits into from
May 13, 2019
Merged

BigQuery: Fixed pandas DataFrames being returned with incorrect index. #7953

merged 2 commits into from
May 13, 2019

Conversation

eriknil
Copy link
Contributor

@eriknil eriknil commented May 13, 2019

When loading large datasets from BIgQuery as pandas DataFrames, sometimes the index contains duplicates. This happens when the results are collected as multiple DataFrames and then concatenated without resetting the index.

As an example, we get 0, 1 repeated in the index:

In [1]: import pandas as pd
In [2]: x = pd.DataFrame({"a": [1, 2]})
   ...: pd.concat([x, x])
Out[2]:
   a
0  1
1  2
0  1
1  2

which has some unintended consequences when we try to subset:

In [3]: pd.concat([x, x]).loc[0]
Out[3]:
   a
0  1
0  1

Instead, by setting ignore_index=True we get:

In [4]: pd.concat([x, x], ignore_index=True)
Out[4]:
   a
0  1
1  2
2  1
3  2

From the pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html:

ignore_index : boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

In our case the indexes really don't have any meaningful information.

Reproducible example

We can see this in action by running the following query on the crypto-dash dataset:

from google.cloud import bigquery
bq_client = bigquery.Client()

query = """
SELECT
  block_timestamp_month
FROM
  `bigquery-public-data.crypto_dash.transactions`
LIMIT
  1000000
"""
data = bq_client.query(query).result().to_dataframe()

and then checking if the indexes are unique:

print(data.index.is_unique)
> False

I have replicated the problem with two different unit tests, and fixed it in this PR.

@eriknil eriknil requested a review from a team May 13, 2019 00:15
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no This human has *not* signed the Contributor License Agreement. label May 13, 2019
@eriknil
Copy link
Contributor Author

eriknil commented May 13, 2019

I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels May 13, 2019
Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, explanation, and the unit tests!

@tswast tswast merged commit 53e492c into googleapis:master May 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants