Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support for reading and writing datetimes with timezones #253

Merged
merged 55 commits into from
Oct 20, 2023
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
e075091
minimal working pandas layer without timezones
m-richards May 7, 2023
3df7936
implement datetime_as_string toggle to get numpy layer working
m-richards May 7, 2023
d68b473
make tests pass
m-richards May 8, 2023
9aa5a8c
add tests showing existing behaviour no tz
m-richards May 8, 2023
1a2af4d
working read
m-richards May 8, 2023
fbd2898
commit my test file
m-richards May 9, 2023
127d0a7
actually fix tests with read working
m-richards May 10, 2023
016778a
good enough wip progress for now
m-richards May 21, 2023
faa0631
make these failures easier to read
m-richards May 21, 2023
a8c200e
fix for non tz
m-richards May 21, 2023
6047375
fix some tests
m-richards May 22, 2023
6061563
run pre commit
m-richards May 22, 2023
3ba42cf
maybe old pandas, can't reproduce locally
m-richards May 23, 2023
d983140
try and find something pandas 1.5 also happy with
m-richards May 23, 2023
e9993bd
lint
m-richards May 23, 2023
b6ca5cf
simple answer
m-richards May 23, 2023
05cc1cf
cleanup
m-richards May 25, 2023
a78a76c
wip, use strings to make multi timezones round trip
m-richards Jun 3, 2023
b681656
use tmp path fixture
m-richards Jun 3, 2023
3426fdc
cleanups
m-richards Jun 3, 2023
bb6fd4e
try cleanup datetime parsing
m-richards Jun 3, 2023
87419ac
more cleanup, realise we can get dt resolution
m-richards Jun 3, 2023
fc78bd9
more careful pandas 1.5 compat
m-richards Jun 3, 2023
5fab348
delete line
m-richards Jun 3, 2023
26c403a
replace write support with working datetime object solution
m-richards Aug 8, 2023
ebdb71b
fixes
m-richards Aug 8, 2023
f46e716
rewrite datetime reading to handle mixed offset to utc
m-richards Aug 8, 2023
44686f9
fix nat handling for datetime as string
m-richards Aug 8, 2023
6b946f5
don't expose datetime_as_string in pandas layer
m-richards Aug 8, 2023
ec16ed3
incorrect variable in 1.5.3 compat
m-richards Aug 8, 2023
da0639a
CLN: tidy up pandas 2.0 compat
m-richards Aug 9, 2023
85a67c2
suggested alternative implementation
m-richards Sep 24, 2023
d96d67e
code review suggestion
m-richards Sep 24, 2023
3eb70dc
Update pyogrio/tests/test_geopandas_io.py
m-richards Sep 24, 2023
c37c1ed
Merge remote-tracking branch 'upstream/main' into matt/timezones_redo
m-richards Sep 28, 2023
4064f25
Merge branches 'matt/timezones_redo' and 'matt/timezones_redo' of git…
m-richards Sep 28, 2023
3df12c0
time tests and suggestions
m-richards Sep 28, 2023
8fd30a5
remove breakpoint
m-richards Sep 28, 2023
55293c0
catch warning
m-richards Sep 30, 2023
8040c21
really need to fix my local gdal
m-richards Sep 30, 2023
fccc8fb
fix fix
m-richards Sep 30, 2023
200cc1d
Apply suggestions from code review
m-richards Sep 30, 2023
ebfc01c
add suggested exception handling
m-richards Sep 30, 2023
c8c186a
move pandas compat to _compat
m-richards Oct 7, 2023
95030c0
address review comments
m-richards Oct 7, 2023
c5c272b
Merge remote-tracking branch 'upstream/main' into matt/timezones_redo
m-richards Oct 7, 2023
086e52e
update known issues
m-richards Oct 7, 2023
2b2dd5f
reword
m-richards Oct 7, 2023
2167d0f
move documentation
m-richards Oct 17, 2023
ab0fbf6
rename field as suggested
m-richards Oct 17, 2023
e3f4d6a
Merge remote-tracking branch 'upstream/main' into matt/timezones_redo
m-richards Oct 17, 2023
0f02115
final missing gdal tz offset change
m-richards Oct 17, 2023
52a922d
Update pyogrio/tests/test_geopandas_io.py
m-richards Oct 17, 2023
7c99e51
Apply suggestions from code review
m-richards Oct 17, 2023
a5f5f9d
add changelog entry
brendan-ward Oct 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests-conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,4 +66,4 @@ jobs:

- name: Test
run: |
pytest -v -r s pyogrio/tests
pytest -v --color=yes -r s pyogrio/tests
61 changes: 38 additions & 23 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,8 @@ cdef process_fields(
object field_data_view,
object field_indexes,
object field_ogr_types,
encoding
encoding,
bint datetime_as_string
):
cdef int j
cdef int success
Expand Down Expand Up @@ -657,22 +658,27 @@ cdef process_fields(
data[i] = bin_value[:ret_length]

elif field_type == OFTDateTime or field_type == OFTDate:
success = OGR_F_GetFieldAsDateTimeEx(
ogr_feature, field_index, &year, &month, &day, &hour, &minute, &fsecond, &timezone)

if datetime_as_string:
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
# defer datetime parsing to user/ pandas layer
data[i] = get_string(OGR_F_GetFieldAsString(ogr_feature, field_index), encoding=encoding)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
else:
success = OGR_F_GetFieldAsDateTimeEx(
ogr_feature, field_index, &year, &month, &day, &hour, &minute, &fsecond, &timezone)

ms, ss = math.modf(fsecond)
second = int(ss)
# fsecond has millisecond accuracy
microsecond = round(ms * 1000) * 1000
ms, ss = math.modf(fsecond)
second = int(ss)
# fsecond has millisecond accuracy
microsecond = round(ms * 1000) * 1000

if not success:
data[i] = np.datetime64('NaT')
if not success:
data[i] = np.datetime64('NaT')

elif field_type == OFTDate:
data[i] = datetime.date(year, month, day).isoformat()
elif field_type == OFTDate:
data[i] = datetime.date(year, month, day).isoformat()

elif field_type == OFTDateTime:
data[i] = datetime.datetime(year, month, day, hour, minute, second, microsecond).isoformat()
elif field_type == OFTDateTime:
data[i] = datetime.datetime(year, month, day, hour, minute, second, microsecond).isoformat()


@cython.boundscheck(False) # Deactivate bounds checking
Expand All @@ -685,7 +691,8 @@ cdef get_features(
uint8_t force_2d,
int skip_features,
int num_features,
uint8_t return_fids
uint8_t return_fids,
bint datetime_as_string
):

cdef OGRFeatureH ogr_feature = NULL
Expand Down Expand Up @@ -718,7 +725,9 @@ cdef get_features(

field_data = [
np.empty(shape=(num_features, ),
dtype=fields[field_index,3]) for field_index in range(n_fields)
dtype = ("object" if datetime_as_string and
fields[field_index,3].startswith("datetime") else fields[field_index,3])
) for field_index in range(n_fields)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
]

field_data_view = [field_data[field_index][:] for field_index in range(n_fields)]
Expand Down Expand Up @@ -758,7 +767,7 @@ cdef get_features(

process_fields(
ogr_feature, i, n_fields, field_data, field_data_view,
field_indexes, field_ogr_types, encoding
field_indexes, field_ogr_types, encoding, datetime_as_string
)
i += 1
finally:
Expand Down Expand Up @@ -788,7 +797,8 @@ cdef get_features_by_fid(
object[:,:] fields,
encoding,
uint8_t read_geometry,
uint8_t force_2d
uint8_t force_2d,
bint datetime_as_string
):

cdef OGRFeatureH ogr_feature = NULL
Expand All @@ -811,10 +821,11 @@ cdef get_features_by_fid(
n_fields = fields.shape[0]
field_indexes = fields[:,0]
field_ogr_types = fields[:,1]

field_data = [
np.empty(shape=(count, ),
dtype=fields[field_index,3]) for field_index in range(n_fields)
dtype=("object" if datetime_as_string and fields[field_index,3].startswith("datetime")
else fields[field_index,3]))
for field_index in range(n_fields)
]

field_data_view = [field_data[field_index][:] for field_index in range(n_fields)]
Expand All @@ -837,7 +848,7 @@ cdef get_features_by_fid(

process_fields(
ogr_feature, i, n_fields, field_data, field_data_view,
field_indexes, field_ogr_types, encoding
field_indexes, field_ogr_types, encoding, datetime_as_string
)
finally:
if ogr_feature != NULL:
Expand Down Expand Up @@ -939,7 +950,9 @@ def ogr_read(
object fids=None,
str sql=None,
str sql_dialect=None,
int return_fids=False):
int return_fids=False,
bint datetime_as_string=False
):

cdef int err = 0
cdef const char *path_c = NULL
Expand Down Expand Up @@ -1022,6 +1035,7 @@ def ogr_read(
encoding,
read_geometry=read_geometry and geometry_type is not None,
force_2d=force_2d,
datetime_as_string=datetime_as_string
)

# bypass reading fids since these should match fids used for read
Expand Down Expand Up @@ -1051,13 +1065,15 @@ def ogr_read(
force_2d=force_2d,
skip_features=skip_features,
num_features=num_features,
return_fids=return_fids
return_fids=return_fids,
datetime_as_string=datetime_as_string
)

meta = {
'crs': crs,
'encoding': encoding,
'fields': fields[:,2], # return only names
'dtypes':fields[:,3],
'geometry_type': geometry_type,
}

Expand Down Expand Up @@ -1796,7 +1812,6 @@ def ogr_write(
if np.isnat(field_value):
OGR_F_SetFieldNull(ogr_feature, field_idx)
else:
# TODO: add support for timezones
datetime = field_value.astype("datetime64[ms]").item()
OGR_F_SetFieldDateTimeEx(
ogr_feature,
Expand Down
26 changes: 24 additions & 2 deletions pyogrio/geopandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from pyogrio.raw import DRIVERS_NO_MIXED_SINGLE_MULTI, DRIVERS_NO_MIXED_DIMENSIONS
from pyogrio.raw import detect_driver, read, read_arrow, write
from pyogrio.errors import DataSourceError
from packaging.version import Version


def _stringify_path(path):
Expand All @@ -19,6 +20,18 @@ def _stringify_path(path):
return path


def _try_parse_datetime(ser):
import pandas as pd # only called in a block where pandas is known to be installed

if Version(pd.__version__) >= Version("2.0.0"):
res = pd.to_datetime(ser, format="ISO8601")
if res.dtype != "object":
res = res.dt.as_unit("ms")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to try to detect if there are timezones or not (eg check the first (few) strings to see if they contain a "+" or "-"), because if so we have to pass utc=True to not get object dtype:

In [20]: pd.to_datetime(["2021/01/02 01:02:03+01:00", "2021/01/02 01:02:03+01:00"], format="ISO8601")
Out[20]: DatetimeIndex(['2021-01-02 01:02:03+01:00', '2021-01-02 01:02:03+01:00'], dtype='datetime64[ns, UTC+01:00]', freq=None)

In [21]: pd.to_datetime(["2021/01/02 01:02:03+01:00", "2021/01/02 01:02:03+02:00"], format="ISO8601")
Out[21]: Index([2021-01-02 01:02:03+01:00, 2021-01-02 01:02:03+02:00], dtype='object')

(with longer timeseries, it is very normal to have mixed offsets for the same timezone, i.e. values with and without DST)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, I checked with what we did in geopandas for the fiona IO and did the same thing as there to be consistent -> read without utc=True, if still object dtype then read with utc=True and see if that makes a difference.

I could also look at check for "+" and "-" as you suggest, I'd imagine that's faster, but maybe there are some edge cases?

jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
return res
else:
return pd.to_datetime(ser, yearfirst=True)


def read_dataframe(
path_or_buffer,
/,
Expand Down Expand Up @@ -146,6 +159,8 @@ def read_dataframe(
path_or_buffer = _stringify_path(path_or_buffer)

read_func = read_arrow if use_arrow else read
if not use_arrow and "datetime_as_string" not in kwargs:
kwargs["datetime_as_string"] = True
result = read_func(
path_or_buffer,
layer=layer,
Expand Down Expand Up @@ -182,8 +197,10 @@ def read_dataframe(
index = pd.Index(index, name="fid")
else:
index = None

df = pd.DataFrame(data, columns=columns, index=index)
for dtype, c in zip(meta["dtypes"], df.columns):
if dtype.startswith("datetime"):
df[c] = _try_parse_datetime(df[c])

if geometry is None or not read_geometry:
return df
Expand Down Expand Up @@ -327,7 +344,12 @@ def write_dataframe(
field_data = []
field_mask = []
for name in fields:
col = df[name].values
ser = df[name]
m-richards marked this conversation as resolved.
Show resolved Hide resolved
col = ser.values
if isinstance(ser.dtype, pd.DatetimeTZDtype):
# Deal with datetimes with timezones as strings
col = ser.astype(str)
m-richards marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(col, pd.api.extensions.ExtensionArray):
from pandas.arrays import IntegerArray, FloatingArray, BooleanArray

Expand Down
7 changes: 7 additions & 0 deletions pyogrio/raw.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ def read(
sql=None,
sql_dialect=None,
return_fids=False,
datetime_as_string=False,
**kwargs,
):
"""Read OGR data source into numpy arrays.
Expand Down Expand Up @@ -108,6 +109,10 @@ def read(
number of features usings FIDs is also driver specific.
return_fids : bool, optional (default: False)
If True, will return the FIDs of the feature that were read.
datetime_as_string : bool, optional (default: False)
If True, will return datetime dtypes as detected by GDAL as a string
array, instead of a datetime64 array (used to extract timezone info).
m-richards marked this conversation as resolved.
Show resolved Hide resolved

**kwargs
Additional driver-specific dataset open options passed to OGR. Invalid
options will trigger a warning.
Expand Down Expand Up @@ -150,6 +155,7 @@ def read(
sql_dialect=sql_dialect,
return_fids=return_fids,
dataset_kwargs=dataset_kwargs,
datetime_as_string=datetime_as_string,
)
finally:
if buffer is not None:
Expand Down Expand Up @@ -387,6 +393,7 @@ def write(
layer_options=None,
**kwargs,
):
kwargs.pop("dtypes", None)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
if geometry_type is None:
raise ValueError("geometry_type must be provided")

Expand Down
5 changes: 5 additions & 0 deletions pyogrio/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,8 @@ def test_ogr_types_list():
@pytest.fixture(scope="session")
def test_datetime():
return _data_dir / "test_datetime.geojson"


@pytest.fixture(scope="session")
def test_datetime_tz():
return _data_dir / "test_datetime_tz.geojson"
7 changes: 7 additions & 0 deletions pyogrio/tests/fixtures/test_datetime_tz.geojson
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "col": "2020-01-01T09:00:00.123-05:00" }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } },
{ "type": "Feature", "properties": { "col": "2020-01-01T10:00:00-05:00" }, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } }
]
}
51 changes: 49 additions & 2 deletions pyogrio/tests/test_geopandas_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from datetime import datetime
import os
from packaging.version import Version

import numpy as np
import pytest

Expand All @@ -18,7 +17,12 @@

try:
import pandas as pd
from pandas.testing import assert_frame_equal, assert_index_equal
from pandas.testing import (
assert_frame_equal,
assert_index_equal,
assert_series_equal,
)
import pytz

import geopandas as gp
from geopandas.array import from_wkt
Expand Down Expand Up @@ -139,6 +143,49 @@ def test_read_datetime(test_fgdb_vsi):
assert df.SURVEY_DAT.dtype.name == "datetime64[ns]"


def test_read_datetime_tz(test_datetime_tz, tmp_path):
df = read_dataframe(test_datetime_tz)
raw_expected = ["2020-01-01T09:00:00.123-05:00", "2020-01-01T10:00:00-05:00"]

if Version(pd.__version__) >= Version("2.0.0"):
expected = pd.to_datetime(raw_expected, format="ISO8601").as_unit("ms")
else:
expected = pd.to_datetime(raw_expected)
expected = pd.Series(expected, name="col")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected = pd.Series(expected, name="col")
expected = pd.Series(expected, name="datetime_col")


assert_series_equal(df.col, expected)
# test write and read round trips
# TODO gpkg doesn't work here, at least for my local gdal, writes NaT
fpath = tmp_path / "test.geojson"
write_dataframe(df, fpath)
df_read = read_dataframe(fpath)
assert_series_equal(df_read.col, expected)


def test_write_datetime_mixed_offset(tmp_path):
# Summer Time (GMT+11), standard time (GMT+10)
m-richards marked this conversation as resolved.
Show resolved Hide resolved
dates = ["2023-01-01 11:00:01.111", "2023-06-01 10:00:01.111"]
tz = pytz.timezone("Australia/Sydney")
ser_naive = pd.Series(pd.to_datetime(dates), name="dates")
m-richards marked this conversation as resolved.
Show resolved Hide resolved
ser_localised = ser_naive.dt.tz_localize(tz)
df = gp.GeoDataFrame(
{"dates": ser_localised, "geometry": [Point(1, 1), Point(1, 1)]}
)
fpath = tmp_path / "test.geojson"
write_dataframe(df, fpath)
df_no_tz = read_dataframe(
fpath, datetime_as_string=False
) # TODO this shouldn't be called datetime as string in the pandas layer,
# should it even be accessible?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we have some users asking for it, I wouldn't expose it in read_dataframe (so only in raw.read), just to keep it simpler in read_dataframe

# datetime_as_string=False ignores tz info, returns datetime objects
expected = ser_naive.astype("datetime64[ms]")
assert_series_equal(expected, df_no_tz["dates"])
# datetime_as_string=True keeps tz info, but pandas can't handle multiple offsets
# unless given a timezone to identify them with -> returned as strings
df_local = read_dataframe(fpath, datetime_as_string=True)
assert_series_equal(ser_localised.astype("object"), df_local["dates"])


def test_read_null_values(test_fgdb_vsi):
df = read_dataframe(test_fgdb_vsi, read_geometry=False)

Expand Down
13 changes: 13 additions & 0 deletions pyogrio/tests/test_raw_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -613,6 +613,19 @@ def test_read_datetime_millisecond(test_datetime):
assert field[1] == np.datetime64("2020-01-01 10:00:00.000")


def test_read_datetime_tz(test_datetime_tz):
m-richards marked this conversation as resolved.
Show resolved Hide resolved
field = read(test_datetime_tz)[3][0]
assert field.dtype == "datetime64[ms]"
# timezone is ignored in numpy layer
assert field[0] == np.datetime64("2020-01-01 09:00:00.123")
assert field[1] == np.datetime64("2020-01-01 10:00:00.000")
field = read(test_datetime_tz, datetime_as_string=True)[3][0]
assert field.dtype == "object"
# GDAL doesn't return strings in ISO format (yet)
assert field[0] == "2020/01/01 09:00:00.123-05"
assert field[1] == "2020/01/01 10:00:00-05"


@pytest.mark.parametrize("ext", ["gpkg", "geojson"])
def test_read_write_null_geometry(tmp_path, ext):
# Point(0, 0), null
Expand Down