Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: default to DATETIME type when loading timezone-naive datetimes from Pandas #1061

Merged
merged 4 commits into from
Nov 16, 2021

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Nov 15, 2021

Closes #985.

This proved to be more tricky than expected, because manual introspection is needed when augmenting the schema - pyarrow attaches the UTC timezone to naive datetimes, making it problematic to distinguish these from timezone-aware datetimes.

PR checklist:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@plamut plamut requested review from tswast and a team November 15, 2021 10:03
@plamut plamut requested a review from a team as a code owner November 15, 2021 10:03
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 15, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 15, 2021
Comment on lines +524 to +527
if detected_type == "TIMESTAMP":
valid_item = _first_array_valid(dataframe[field.name])
if isinstance(valid_item, datetime) and valid_item.tzinfo is None:
detected_type = "DATETIME"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of doing this check for all detected TIMESTAMP values, but it turned out it's only necessary for datetimes inside an array, because that's when we need to use pyarrow to help.

For datetime values outside of arrays, we can already distinguish between naive and aware ones based on Pandas dtypes, meaning that we do not even enter augment_schema() for them.


# Valid item is None because all items in the "valid" array are invalid. Try
# to find a true valid array manually.
for array in islice(series, first_valid_index + 1, None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure if slicing the series results in an unnecessary copy (Pandas docs say it's context-dependent), thus played it safe and just used islice.

@plamut plamut requested a review from a team as a code owner November 15, 2021 15:03
@plamut plamut requested a review from parthea November 15, 2021 15:03
@plamut
Copy link
Contributor Author

plamut commented Nov 15, 2021

Status checks got stuck...

@plamut plamut added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 15, 2021
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 15, 2021
Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good catch in identifying the additional logic needed for arrays of DATETIME

@tswast tswast merged commit 3cae066 into googleapis:v3 Nov 16, 2021
@plamut plamut deleted the iss-985 branch November 16, 2021 20:46
tswast added a commit that referenced this pull request Mar 29, 2022
deps!: BigQuery Storage and pyarrow are required dependencies (#776)

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (#786) 

feat!: destination tables are no-longer removed by `create_job` (#891)

feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (#972)

fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (#972)

feat!: mark the package as type-checked (#1058)

feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (#1061)

feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (#967)

fix: improve type annotations for mypy validation (#1081)

feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (#1117)

docs: Add migration guide from version 2.x to 3.x (#1027)

Release-As: 3.0.0
waltaskew pushed a commit to waltaskew/python-bigquery that referenced this pull request Jul 20, 2022
deps!: BigQuery Storage and pyarrow are required dependencies (googleapis#776)

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (googleapis#786) 

feat!: destination tables are no-longer removed by `create_job` (googleapis#891)

feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (googleapis#972)

fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (googleapis#972)

feat!: mark the package as type-checked (googleapis#1058)

feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (googleapis#1061)

feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (googleapis#967)

fix: improve type annotations for mypy validation (googleapis#1081)

feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (googleapis#1117)

docs: Add migration guide from version 2.x to 3.x (googleapis#1027)

Release-As: 3.0.0
abdelmegahedgoogle pushed a commit to abdelmegahedgoogle/python-bigquery that referenced this pull request Apr 17, 2023
deps!: BigQuery Storage and pyarrow are required dependencies (googleapis#776)

fix!: use nullable `Int64` and `boolean` dtypes in `to_dataframe` (googleapis#786) 

feat!: destination tables are no-longer removed by `create_job` (googleapis#891)

feat!: In `to_dataframe`, use `dbdate` and `dbtime` dtypes from db-dtypes package for BigQuery DATE and TIME columns (googleapis#972)

fix!: automatically convert out-of-bounds dates in `to_dataframe`, remove `date_as_object` argument (googleapis#972)

feat!: mark the package as type-checked (googleapis#1058)

feat!: default to DATETIME type when loading timezone-naive datetimes from Pandas (googleapis#1061)

feat: add `api_method` parameter to `Client.query` to select `INSERT` or `QUERY` API (googleapis#967)

fix: improve type annotations for mypy validation (googleapis#1081)

feat: use `StandardSqlField` class for `Model.feature_columns` and `Model.label_columns` (googleapis#1117)

docs: Add migration guide from version 2.x to 3.x (googleapis#1027)

Release-As: 3.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants