xr.infer_freq #4033

aulemahal · 2020-05-05T19:39:05Z

Tests added
Passes isort -rc . && black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

This PR adds a xr.infer_freq method to copy pandas infer_freq but on CFTimeIndex objects. I tried to subclass pandas _FrequencyInferer and to only override as little as possible.

Two things are problematic right now and I would like to get feedback on how to implement them if this PR gets the dev's approval.

pd.DatetimeIndex.asi8 returns integers representing nanoseconds since 1970-1-1, while xr.CFTimeIndex.asi8 returns microseconds. In order not to break the API, I patched the _CFTimeFrequencyInferer to store 1000x the values. Not sure if this is the best, but it works.
As of now, xr.infer_freq will fail on weekly indexes. This is because pandas is using datetime.weekday() at some point but cftime objects do not implement that (they use dayofwk instead). I'm not sure what to do? Cftime could implement it to completly mirror python's datetime or pandas could use dayofwk since it's available on the TimeStamp objects.

Another option, cleaner but longer, would be to reimplement _FrequencyInferer from scratch. I may have time for this, cause I really think a xr.infer_freq method would be useful.

…nto xr-infer-freq

spencerkclark

@aulemahal this is super cool -- thanks for starting on it!

To give some feedback on your two points:

Yes, it was an intentional choice to use microseconds as the unit for asi8 on the CFTimeIndex. cftime dates have microsecond precision, and this allows us to represent dates from a much longer time range than pd.Timestamp objects, which have nanosecond precision.
- You noted this as an option toward the end of your post, and indeed I would actually encourage you to vendor/re-write pandas's _FrequencyInferrer class such that it is specifically designed for cftime dates and microsecond precision. This is for a couple reasons:
  - Multiplying asi8 by 1000 can lead to problems due to integer overflow, e.g. for frequencies greater than ~292 years:
```
In [2]: times = xr.cftime_range("2000", periods=10, freq="106945D")

In [3]: xr.infer_freq(times)
Out[3]: '9240048S'
```
  - _FrequencyInferrer is private API in pandas, which means that they could change and break things we might rely on without warning. I've had a look at pandas's implementation and it doesn't seem to depend too much on other internal functionality to pandas -- in fact you've already implemented some of the more significant internal functions it depends on, build_field_sarray and month_position_check -- which means that hopefully it shouldn't be too onerous to copy the rest over.
Given that I am recommending you do not subclass _FrequencyInferrer, I think this is no longer super relevant (i.e. use cftime.datetime.dayofwk in your implementation where needed). I will note that we do not currently implement business-related or weekly cftime offset objects, which will limit the utility of inferring those kinds of frequencies for the time being, but you're definitely not required to add those in this PR!

aulemahal · 2020-05-14T20:02:34Z

@spencerkclark Thanks for the feedback!

I rewrote the _CFTimeFrequencyInferer class independently of pandas private objects. I removed outputs that wouldn't match offsets defined in xarray: weekly frequencies, week-of-month (WOM) and all business-related anchors. It wouldn't be difficult to add them later on if need be and if cftime_offsets.py defines them. Note that infer_freq is quite independent of the rest of xarray, this choice was purely so that the behavior matches the rest of xr. I did leave the sub-seconds frequencies (L for ms and U for µs).

I expect the performance of xr.infer_freq to be slightly worst than pandas' since they use optimized and compiled code whereas I only used common numpy and xarray stuff. However, I assume that this part of the workflow is far from being the bottleneck of most xarray usecases...

Will write up some docs soon.

spencerkclark

Thanks @aulemahal, this is looking really nice. I made a first pass of a review.

I removed outputs that wouldn't match offsets defined in xarray: weekly frequencies, week-of-month (WOM) and all business-related anchors. It wouldn't be difficult to add them later on if need be and if cftime_offsets.py defines them. Note that infer_freq is quite independent of the rest of xarray, this choice was purely so that the behavior matches the rest of xr.

Yeah this is totally fine. No one so far has requested these types of frequencies for cftime objects, and it sort of makes sense. While we can somewhat arbitrarily define them for non-standard calendars, the distinction between weekdays and weekends generally only really matters for the kinds of data that would be indexed with a standard calendar.

I expect the performance of xr.infer_freq to be slightly worst than pandas' since they use optimized and compiled code whereas I only used common numpy and xarray stuff. However, I assume that this part of the workflow is far from being the bottleneck of most xarray usecases...

It's indeed slower for cftime objects, but I think much of that is due to the slowness of CFTimeIndex.asi8 compared to DatetimeIndex.asi8 (not any code you wrote in this PR!). I agree this is likely not a bottleneck for most xarray use-cases.

xarray/tests/test_cftimeindex.py

xarray/coding/frequencies.py

xarray/tests/test_cftimeindex.py

xarray/coding/frequencies.py

Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

spencerkclark

Thanks for all the updates! I think this is almost ready to go from my perspective. I just have a few more very minor requests for tests, based on the latest coverage report:

Test that a non-monotonic index results in None returned as a frequency.
Test that an index with non-unique values results in None returned as a frequency.
Test that an index with more than one monthly delta results in None as a returned frequency, e.g. ["2000-01-01", "2000-02-01", "2000-04-01"].
Test that an index with more than one quarterly delta results in None as a returned frequency, e.g. ["2000-01-01", "2000-04-01", "2000-10-01"].

Finally could you add an issue requesting that microsecond and millisecond cftime offsets be added? Those would be straightforward to add for someone motivated, but are not required as part of this PR.

xarray/coding/frequencies.py

doc/weather-climate.rst

doc/whats-new.rst

doc/weather-climate.rst

Co-authored-by: Spencer Clark <spencerkclark@gmail.com> Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

aulemahal · 2020-05-26T19:01:36Z

@spencerkclark Here you go! Tests added for all cases you mentioned.

spencerkclark

Looks great @aulemahal — I’ll merge this in a few days if there are no further comments.

xarray/coding/frequencies.py

spencerkclark · 2020-05-30T18:11:36Z

Thanks @aulemahal! Nice work.

…o-combine * 'master' of github.com:pydata/xarray: (81 commits) use builtin python types instead of the numpy alias (pydata#4170) Revise pull request template (pydata#4039) pint support for Dataset (pydata#3975) drop eccodes in docs (pydata#4162) Update issue templates inspired/based on dask (pydata#4154) Fix failing upstream-dev build & remove docs build (pydata#4160) Improve typehints of xr.Dataset.__getitem__ (pydata#4144) provide a error summary for assert_allclose (pydata#3847) built-in accessor documentation (pydata#3988) Recommend installing cftime when time decoding fails. (pydata#4134) parameter documentation for DataArray.sel (pydata#4150) speed up map_blocks (pydata#4149) Remove outdated note from datetime accessor docstring (pydata#4148) Fix the upstream-dev pandas build failure (pydata#4138) map_blocks: Allow passing dask-backed objects in args (pydata#3818) keep attrs in reset_index (pydata#4103) Fix open_rasterio() for WarpedVRT with specified src_crs (pydata#4104) Allow non-unique and non-monotonic coordinates in get_clean_interp_index and polyfit (pydata#4099) update numpy's intersphinx url (pydata#4117) xr.infer_freq (pydata#4033) ...

aulemahal added 4 commits May 5, 2020 14:58

xr.infer_freq and related code

c355708

Merge branch 'master' into xr-infer-freq

a37df84

Formatting and comments

1391aa1

Merge branch 'xr-infer-freq' of https://github.com/aulemahal/xarray i…

7671eab

…nto xr-infer-freq

TomNicholas requested a review from spencerkclark May 7, 2020 12:37

spencerkclark reviewed May 9, 2020

View reviewed changes

aulemahal added 4 commits May 14, 2020 12:01

Merge branch 'master' into xr-infer-freq

a9e317f

Rewrite _CFTimeFrequencyInferer independently of pandas

6de204c

Syntax and add frequency.py file

5be8966

Fix tests and month_deltas

204c15f

Require cftime 1.1.0 for the test

4d07e3d

spencerkclark reviewed May 17, 2020

View reviewed changes

aulemahal and others added 8 commits May 25, 2020 15:59

Apply suggestions from code review

8e11f84

Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

Changes following code review

81f5562

Docs

b2005be

Docs

9806e65

Black

7f864ab

Fix tests for requiring cftime 1.1.0

9752d79

Update whats-new

1e23d5d

Merge branch 'master' into xr-infer-freq

3846e1a

aulemahal changed the title ~~[WIP] xr.infer_freq~~ xr.infer_freq May 25, 2020

aulemahal requested a review from spencerkclark May 25, 2020 22:43

spencerkclark reviewed May 26, 2020

View reviewed changes

xarray/coding/frequencies.py Show resolved Hide resolved

doc/weather-climate.rst Outdated Show resolved Hide resolved

doc/whats-new.rst Outdated Show resolved Hide resolved

mathause reviewed May 26, 2020

View reviewed changes

doc/weather-climate.rst Outdated Show resolved Hide resolved

aulemahal and others added 2 commits May 26, 2020 14:46

Apply suggestions from code review

c49bece

Co-authored-by: Spencer Clark <spencerkclark@gmail.com> Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>

Add invalid input tests for better coverage

94f3634

aulemahal mentioned this pull request May 26, 2020

CFTime offsets missing for milli- and micro-seconds #4097

Closed

spencerkclark approved these changes May 27, 2020

View reviewed changes

dcherian mentioned this pull request May 27, 2020

0.16.0 release #4031

Closed

23 tasks

spencerkclark reviewed May 29, 2020

View reviewed changes

xarray/coding/frequencies.py Show resolved Hide resolved

Fix link in whats-new.rst

9a553ed

spencerkclark merged commit fd9e620 into pydata:master May 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.infer_freq #4033

xr.infer_freq #4033

aulemahal commented May 5, 2020 •

edited by spencerkclark

Loading

spencerkclark left a comment

aulemahal commented May 14, 2020

spencerkclark left a comment

spencerkclark left a comment

aulemahal commented May 26, 2020

spencerkclark left a comment

spencerkclark commented May 30, 2020

xr.infer_freq #4033

xr.infer_freq #4033

Conversation

aulemahal commented May 5, 2020 • edited by spencerkclark Loading

spencerkclark left a comment

Choose a reason for hiding this comment

aulemahal commented May 14, 2020

spencerkclark left a comment

Choose a reason for hiding this comment

spencerkclark left a comment

Choose a reason for hiding this comment

aulemahal commented May 26, 2020

spencerkclark left a comment

Choose a reason for hiding this comment

spencerkclark commented May 30, 2020

aulemahal commented May 5, 2020 •

edited by spencerkclark

Loading