Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Use pandas custom data types for BigQuery DATE and TIME columns, remove date_as_object argument #972

Merged
merged 87 commits into from
Nov 10, 2021
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
849f2c0
feat: add pandas time arrays to hold BigQuery TIME data
jimfulton Sep 9, 2021
99b43b6
remove commented/aborted base class and simplify super call in __init__
jimfulton Sep 9, 2021
f722570
dtypes tests pass with pandas 0.24.2
jimfulton Sep 9, 2021
724f62f
blacken and lint
jimfulton Sep 9, 2021
9b03449
Added DateArray and generalized tests to handle dates and times
jimfulton Sep 9, 2021
6af561d
blacken
jimfulton Sep 9, 2021
7fc299c
handle bad values passed to areay construction
jimfulton Sep 9, 2021
869f5f7
Add null/None handling
jimfulton Sep 9, 2021
5505069
Added repeat and copy tests
jimfulton Sep 11, 2021
15ca7e1
don't use extract_array and make _from_sequence_of_strings and alias …
jimfulton Sep 11, 2021
9392cb4
Summary test copying ndarray arg
jimfulton Sep 11, 2021
04c1e6a
expand construction test to test calling class directly and calling _…
jimfulton Sep 11, 2021
f59169c
blacken
jimfulton Sep 11, 2021
c33f608
test size and shape
jimfulton Sep 11, 2021
41bbde6
Enable properties for pandas 1.3 and later
jimfulton Sep 13, 2021
e9ed1c4
Test more pandas versions.
jimfulton Sep 13, 2021
d6d81fe
Updated import_default with an option to force use of the default
jimfulton Sep 13, 2021
a8697f3
simplified version-checking code
jimfulton Sep 13, 2021
8ea43c5
_from_factorized
jimfulton Sep 13, 2021
2f37929
fix small DRY violation for parametrization by date tand time dtypes
jimfulton Sep 14, 2021
ad0e3c0
isna
jimfulton Sep 14, 2021
c1ebb5c
take
jimfulton Sep 14, 2021
eaa2e96
_concat_same_type
jimfulton Sep 14, 2021
82a2d84
fixed __getitem__ to handle array indexes
jimfulton Sep 14, 2021
6f4178f
test __getitem__ w arrau index and dropna
jimfulton Sep 14, 2021
d8818e0
fix assignment with array keys and fillna test
jimfulton Sep 14, 2021
6bfb75b
reminder (for someone :) ) to override some abstract implementations …
jimfulton Sep 14, 2021
3364bdd
unique test
jimfulton Sep 14, 2021
661c6b2
test argsort
jimfulton Sep 14, 2021
ac7330c
fix version in constraint
jimfulton Sep 14, 2021
9b7a72c
blacken/lint
jimfulton Sep 14, 2021
47b0756
stop fighting the framework and store as ns
jimfulton Sep 14, 2021
48f2e11
blacken/lint
jimfulton Sep 14, 2021
b1025b7
Implement astype to fix Python 3.7 failure
jimfulton Sep 15, 2021
7acdb05
test assigning None
jimfulton Sep 15, 2021
731634e
test astype self type w copy
jimfulton Sep 15, 2021
903e23c
test a concatenation dtype sanity check
jimfulton Sep 15, 2021
e52c65d
Added conversion of date to datetime
jimfulton Sep 15, 2021
74ef1b0
convert times to time deltas
jimfulton Sep 15, 2021
91d9e2b
Use new pandas date and time dtypes
jimfulton Sep 15, 2021
517307c
Get rid of date_as_object argument
jimfulton Sep 15, 2021
711cfaf
fixed a comment
jimfulton Sep 15, 2021
7cdea07
added *unit* test for dealimng with dates and timestamps that can't f…
jimfulton Sep 15, 2021
a20f67b
Removed brittle hack that enabled series properties.
jimfulton Sep 16, 2021
96ed76a
Add note on possible zero-copy optimization for dates
jimfulton Sep 16, 2021
159c202
Implemented any, all, min, max and median
jimfulton Sep 16, 2021
a81b26e
make pytype happy
jimfulton Sep 16, 2021
98b3603
test (and fix) load from dataframe with date and time columns
jimfulton Sep 16, 2021
4f71c9c
Make sure insert_rows_from_dataframe works
jimfulton Sep 16, 2021
5c25ba4
Renamed date and time dtypes to bqdate and bqtime
jimfulton Sep 16, 2021
c6dabe2
make fallback date and time dtype names strings to make pytype happy
jimfulton Sep 17, 2021
7021601
date and time arrays implement __arrow_array__
jimfulton Sep 17, 2021
2585dbc
Document new dtypes
jimfulton Sep 17, 2021
77c1c9e
blacken/lint
jimfulton Sep 18, 2021
4261b80
Make conversion of date columns from arrow pandas outout to pandas ze…
jimfulton Sep 18, 2021
2671718
Added date math support
jimfulton Sep 18, 2021
36eb58c
fix end tag
jimfulton Sep 18, 2021
a8d0cb0
fixed snippet ref
jimfulton Sep 18, 2021
3eabca3
Support date math with DateOffset scalars
jimfulton Sep 18, 2021
b81b996
use db-dtypes
jimfulton Sep 27, 2021
dad0d36
Include db-dtypes in sample requirements
jimfulton Sep 27, 2021
35df752
Updated docs and snippets for db-dtypes
jimfulton Sep 27, 2021
ba776f6
gaaaa, missed a bqdate
jimfulton Sep 27, 2021
10555e1
Merge branch 'v3' into dtypes-v3
parthea Oct 8, 2021
fc451f5
Merge remote-tracking branch 'upstream/v3' into dtypes-v3
tswast Nov 1, 2021
08d1b70
move db-dtypes samples to db-dtypes docs
tswast Nov 1, 2021
d3b57e0
update to work with db-dtypes 0.2.0+
tswast Nov 1, 2021
0442789
update dtype names in system tests
tswast Nov 2, 2021
c7ff18e
comment with link to arrow data types
tswast Nov 2, 2021
b99fa5f
update db-dtypes version
tswast Nov 2, 2021
8253559
experiment with direct pyarrow to extension array conversion
tswast Nov 3, 2021
c8b5d67
Merge branch 'v3' into dtypes-v3
tswast Nov 5, 2021
4a6ec3e
use types mapper for dbdate
tswast Nov 9, 2021
4d5d229
Merge branch 'dtypes-v3' of github.com:jimfulton/python-bigquery into…
tswast Nov 9, 2021
3953ead
fix constraints
tswast Nov 9, 2021
c4a1f2c
use types mapper where possible
tswast Nov 9, 2021
d7e7c5b
always use types mapper
tswast Nov 9, 2021
eec4103
adjust unit tests to use arrow not avro
tswast Nov 9, 2021
69deb6f
avoid "ValueError: need at least one array to concatenate" with empty…
tswast Nov 9, 2021
732fe86
link to pandas issue
tswast Nov 9, 2021
2cd97a1
remove unnecessary variable
tswast Nov 9, 2021
852d5a8
Update tests/unit/job/test_query_pandas.py
tswast Nov 9, 2021
ea5d254
Update tests/unit/job/test_query_pandas.py
tswast Nov 9, 2021
1d8adb1
add missing db-dtypes requirement
tswast Nov 10, 2021
ef8847d
Merge branch 'dtypes-v3' of github.com:jimfulton/python-bigquery into…
tswast Nov 10, 2021
25a8f75
avoid arrow_schema on older versions of bqstorage
tswast Nov 10, 2021
0d81fa1
Merge remote-tracking branch 'upstream/v3' into dtypes-v3
tswast Nov 10, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion docs/usage/pandas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,25 @@ The following data types are used when creating a pandas DataFrame.
-
* - DATETIME
- datetime64[ns], object
- object is used when there are values not representable in pandas
- The object dtype is used when there are values not representable in a
pandas nanosecond-precision timestamp.
* - DATE
- dbdate, object
- The object dtype is used when there are values not representable in a
pandas nanosecond-precision timestamp.

Requires the ``db-dtypes`` package. See the `db-dtypes usage guide
<https://googleapis.dev/python/db-dtypes/latest/usage.html>`_
* - FLOAT64
- float64
-
* - INT64
- Int64
-
* - TIME
- dbtime
- Requires the ``db-dtypes`` package. See the `db-dtypes usage guide
<https://googleapis.dev/python/db-dtypes/latest/usage.html>`_

Retrieve BigQuery GEOGRAPHY data as a GeoPandas GeoDataFrame
------------------------------------------------------------
Expand Down
66 changes: 39 additions & 27 deletions google/cloud/bigquery/_pandas_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,21 @@
import functools
import logging
import queue
from typing import Dict, Sequence
import warnings

try:
import pandas
except ImportError: # pragma: NO COVER
pandas = None
date_dtype_name = time_dtype_name = "" # Use '' rather than None because pytype
else:
import numpy

from db_dtypes import DateDtype, TimeDtype

date_dtype_name = DateDtype.name
time_dtype_name = TimeDtype.name

import pyarrow
import pyarrow.parquet

Expand Down Expand Up @@ -77,15 +82,6 @@ def _to_wkb(v):

_MAX_QUEUE_SIZE_DEFAULT = object() # max queue size sentinel for BQ Storage downloads

# If you update the default dtypes, also update the docs at docs/usage/pandas.rst.
_BQ_TO_PANDAS_DTYPE_NULLSAFE = {
"BOOL": "boolean",
"BOOLEAN": "boolean",
"FLOAT": "float64",
"FLOAT64": "float64",
"INT64": "Int64",
"INTEGER": "Int64",
}
_PANDAS_DTYPE_TO_BQ = {
"bool": "BOOLEAN",
"datetime64[ns, UTC]": "TIMESTAMP",
Expand All @@ -102,6 +98,8 @@ def _to_wkb(v):
"uint16": "INTEGER",
"uint32": "INTEGER",
"geometry": "GEOGRAPHY",
date_dtype_name: "DATE",
time_dtype_name: "TIME",
}


Expand Down Expand Up @@ -267,26 +265,40 @@ def bq_to_arrow_schema(bq_schema):
return pyarrow.schema(arrow_fields)


def bq_schema_to_nullsafe_pandas_dtypes(
bq_schema: Sequence[schema.SchemaField],
) -> Dict[str, str]:
"""Return the default dtypes to use for columns in a BigQuery schema.
def default_types_mapper(date_as_object: bool = False):
"""Create a mapping from pyarrow types to pandas types.

Only returns default dtypes which are safe to have NULL values. This
includes Int64, which has pandas.NA values and does not result in
loss-of-precision.
This overrides the pandas defaults to use null-safe extension types where
available.

Returns:
A mapping from column names to pandas dtypes.
See: https://arrow.apache.org/docs/python/api/datatypes.html for a list of
data types. See:
tests/unit/test__pandas_helpers.py::test_bq_to_arrow_data_type for
BigQuery to Arrow type mapping.

Note to google-cloud-bigquery developers: If you update the default dtypes,
also update the docs at docs/usage/pandas.rst.
"""
dtypes = {}
for bq_field in bq_schema:
if bq_field.mode.upper() not in {"NULLABLE", "REQUIRED"}:
continue
field_type = bq_field.field_type.upper()
if field_type in _BQ_TO_PANDAS_DTYPE_NULLSAFE:
dtypes[bq_field.name] = _BQ_TO_PANDAS_DTYPE_NULLSAFE[field_type]
return dtypes

def types_mapper(arrow_data_type):
if pyarrow.types.is_boolean(arrow_data_type):
return pandas.BooleanDtype()

elif (
# If date_as_object is True, we know some DATE columns are
# out-of-bounds of what is supported by pandas.
not date_as_object
and pyarrow.types.is_date(arrow_data_type)
):
return DateDtype()

elif pyarrow.types.is_integer(arrow_data_type):
return pandas.Int64Dtype()

elif pyarrow.types.is_time(arrow_data_type):
return TimeDtype()

return types_mapper


def bq_to_arrow_array(series, bq_field):
Expand Down
16 changes: 0 additions & 16 deletions google/cloud/bigquery/job/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -1559,7 +1559,6 @@ def to_dataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
max_results: Optional[int] = None,
geography_as_object: bool = False,
) -> "pandas.DataFrame":
Expand Down Expand Up @@ -1602,12 +1601,6 @@ def to_dataframe(

.. versionadded:: 1.24.0

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

.. versionadded:: 1.26.0

max_results (Optional[int]):
Maximum number of rows to include in the result. No limit by default.

Expand Down Expand Up @@ -1641,7 +1634,6 @@ def to_dataframe(
dtypes=dtypes,
progress_bar_type=progress_bar_type,
create_bqstorage_client=create_bqstorage_client,
date_as_object=date_as_object,
geography_as_object=geography_as_object,
)

Expand All @@ -1654,7 +1646,6 @@ def to_geodataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
max_results: Optional[int] = None,
geography_column: Optional[str] = None,
) -> "geopandas.GeoDataFrame":
Expand Down Expand Up @@ -1697,12 +1688,6 @@ def to_geodataframe(

.. versionadded:: 1.24.0

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

.. versionadded:: 1.26.0

max_results (Optional[int]):
Maximum number of rows to include in the result. No limit by default.

Expand Down Expand Up @@ -1735,7 +1720,6 @@ def to_geodataframe(
dtypes=dtypes,
progress_bar_type=progress_bar_type,
create_bqstorage_client=create_bqstorage_client,
date_as_object=date_as_object,
geography_column=geography_column,
)

Expand Down
88 changes: 43 additions & 45 deletions google/cloud/bigquery/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
import pandas
except ImportError: # pragma: NO COVER
pandas = None
else:
import db_dtypes # noqa

import pyarrow

Expand Down Expand Up @@ -1815,7 +1817,6 @@ def to_dataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
geography_as_object: bool = False,
) -> "pandas.DataFrame":
"""Create a pandas DataFrame by loading all pages of a query.
Expand Down Expand Up @@ -1865,12 +1866,6 @@ def to_dataframe(

.. versionadded:: 1.24.0

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

.. versionadded:: 1.26.0

geography_as_object (Optional[bool]):
If ``True``, convert GEOGRAPHY data to :mod:`shapely`
geometry objects. If ``False`` (default), don't cast
Expand Down Expand Up @@ -1912,40 +1907,44 @@ def to_dataframe(
bqstorage_client=bqstorage_client,
create_bqstorage_client=create_bqstorage_client,
)
default_dtypes = _pandas_helpers.bq_schema_to_nullsafe_pandas_dtypes(
self.schema
)

# Let the user-defined dtypes override the default ones.
# https://stackoverflow.com/a/26853961/101923
dtypes = {**default_dtypes, **dtypes}

# When converting timestamp values to nanosecond precision, the result
# When converting date or timestamp values to nanosecond precision, the result
# can be out of pyarrow bounds. To avoid the error when converting to
# Pandas, we set the timestamp_as_object parameter to True, if necessary.
types_to_check = {
pyarrow.timestamp("us"),
pyarrow.timestamp("us", tz=datetime.timezone.utc),
}

for column in record_batch:
if column.type in types_to_check:
try:
column.cast("timestamp[ns]")
except pyarrow.lib.ArrowInvalid:
timestamp_as_object = True
break
else:
timestamp_as_object = False

extra_kwargs = {"timestamp_as_object": timestamp_as_object}
# Pandas, we set the date_as_object or timestamp_as_object parameter to True,
# if necessary.
date_as_object = not all(
self.__can_cast_timestamp_ns(col)
for col in record_batch
# Type can be date32 or date64 (plus units).
# See: https://arrow.apache.org/docs/python/api/datatypes.html
if str(col.type).startswith("date")
)

df = record_batch.to_pandas(
date_as_object=date_as_object, integer_object_nulls=True, **extra_kwargs
timestamp_as_object = not all(
self.__can_cast_timestamp_ns(col)
for col in record_batch
# Type can be timestamp (plus units and time zone).
# See: https://arrow.apache.org/docs/python/api/datatypes.html
if str(col.type).startswith("timestamp")
)

if len(record_batch) > 0:
df = record_batch.to_pandas(
date_as_object=date_as_object,
timestamp_as_object=timestamp_as_object,
integer_object_nulls=True,
types_mapper=_pandas_helpers.default_types_mapper(
date_as_object=date_as_object
),
)
else:
# Avoid "ValueError: need at least one array to concatenate" on
# older versions of pandas when converting empty RecordBatch to
# DataFrame. See: https://github.com/pandas-dev/pandas/issues/41241
df = pandas.DataFrame([], columns=record_batch.schema.names)

for column in dtypes:
df[column] = pandas.Series(df[column], dtype=dtypes[column])
df[column] = pandas.Series(df[column], dtype=dtypes[column], copy=False)

if geography_as_object:
for field in self.schema:
Expand All @@ -1954,6 +1953,15 @@ def to_dataframe(

return df

@staticmethod
def __can_cast_timestamp_ns(column):
try:
column.cast("timestamp[ns]")
except pyarrow.lib.ArrowInvalid:
return False
else:
return True

# If changing the signature of this method, make sure to apply the same
# changes to job.QueryJob.to_geodataframe()
def to_geodataframe(
Expand All @@ -1962,7 +1970,6 @@ def to_geodataframe(
dtypes: Dict[str, Any] = None,
progress_bar_type: str = None,
create_bqstorage_client: bool = True,
date_as_object: bool = True,
geography_column: Optional[str] = None,
) -> "geopandas.GeoDataFrame":
"""Create a GeoPandas GeoDataFrame by loading all pages of a query.
Expand Down Expand Up @@ -2010,10 +2017,6 @@ def to_geodataframe(

This argument does nothing if ``bqstorage_client`` is supplied.

date_as_object (Optional[bool]):
If ``True`` (default), cast dates to objects. If ``False``, convert
to datetime64[ns] dtype.

geography_column (Optional[str]):
If there are more than one GEOGRAPHY column,
identifies which one to use to construct a geopandas
Expand Down Expand Up @@ -2069,7 +2072,6 @@ def to_geodataframe(
dtypes,
progress_bar_type,
create_bqstorage_client,
date_as_object,
geography_as_object=True,
)

Expand Down Expand Up @@ -2126,7 +2128,6 @@ def to_dataframe(
dtypes=None,
progress_bar_type=None,
create_bqstorage_client=True,
date_as_object=True,
geography_as_object=False,
) -> "pandas.DataFrame":
"""Create an empty dataframe.
Expand All @@ -2136,7 +2137,6 @@ def to_dataframe(
dtypes (Any): Ignored. Added for compatibility with RowIterator.
progress_bar_type (Any): Ignored. Added for compatibility with RowIterator.
create_bqstorage_client (bool): Ignored. Added for compatibility with RowIterator.
date_as_object (bool): Ignored. Added for compatibility with RowIterator.

Returns:
pandas.DataFrame: An empty :class:`~pandas.DataFrame`.
Expand All @@ -2151,7 +2151,6 @@ def to_geodataframe(
dtypes=None,
progress_bar_type=None,
create_bqstorage_client=True,
date_as_object=True,
geography_column: Optional[str] = None,
) -> "pandas.DataFrame":
"""Create an empty dataframe.
Expand All @@ -2161,7 +2160,6 @@ def to_geodataframe(
dtypes (Any): Ignored. Added for compatibility with RowIterator.
progress_bar_type (Any): Ignored. Added for compatibility with RowIterator.
create_bqstorage_client (bool): Ignored. Added for compatibility with RowIterator.
date_as_object (bool): Ignored. Added for compatibility with RowIterator.

Returns:
pandas.DataFrame: An empty :class:`~pandas.DataFrame`.
Expand Down
1 change: 1 addition & 0 deletions samples/geography/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ click==8.0.1
click-plugins==1.1.1
cligj==0.7.2
dataclasses==0.6; python_version < '3.7'
db-dtypes==0.3.0
Fiona==1.8.20
geojson==2.5.0
geopandas==0.9.0
Expand Down
1 change: 1 addition & 0 deletions samples/snippets/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
db-dtypes==0.3.0
google-cloud-bigquery-storage==2.9.0
google-auth-oauthlib==0.4.6
grpcio==1.41.0
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
# Keep the no-op bqstorage extra for backward compatibility.
# See: https://github.com/googleapis/python-bigquery/issues/757
"bqstorage": [],
"pandas": ["pandas>=1.0.0"],
"pandas": ["pandas>=1.0.0", "db-dtypes>=0.3.0,<2.0.0dev"],
"geopandas": ["geopandas>=0.9.0, <1.0dev", "Shapely>=1.6.0, <2.0dev"],
"tqdm": ["tqdm >= 4.7.4, <5.0.0dev"],
"opentelemetry": [
Expand Down
1 change: 1 addition & 0 deletions testing/constraints-3.6.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#
# e.g., if setup.py has "foo >= 1.14.0, < 2.0.0dev",
# Then this file should have foo==1.14.0
db-dtypes==0.3.0
geopandas==0.9.0
google-api-core==1.29.0
google-cloud-bigquery-storage==2.0.0
Expand Down
Loading