-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xr.infer_freq #4033
xr.infer_freq #4033
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aulemahal this is super cool -- thanks for starting on it!
To give some feedback on your two points:
-
Yes, it was an intentional choice to use microseconds as the unit for
asi8
on the CFTimeIndex. cftime dates have microsecond precision, and this allows us to represent dates from a much longer time range thanpd.Timestamp
objects, which have nanosecond precision.-
You noted this as an option toward the end of your post, and indeed I would actually encourage you to vendor/re-write pandas's
_FrequencyInferrer
class such that it is specifically designed for cftime dates and microsecond precision. This is for a couple reasons:- Multiplying
asi8
by 1000 can lead to problems due to integer overflow, e.g. for frequencies greater than ~292 years:
In [2]: times = xr.cftime_range("2000", periods=10, freq="106945D") In [3]: xr.infer_freq(times) Out[3]: '9240048S'
_FrequencyInferrer
is private API in pandas, which means that they could change and break things we might rely on without warning. I've had a look at pandas's implementation and it doesn't seem to depend too much on other internal functionality to pandas -- in fact you've already implemented some of the more significant internal functions it depends on,build_field_sarray
andmonth_position_check
-- which means that hopefully it shouldn't be too onerous to copy the rest over.
- Multiplying
-
-
Given that I am recommending you do not subclass
_FrequencyInferrer
, I think this is no longer super relevant (i.e. usecftime.datetime.dayofwk
in your implementation where needed). I will note that we do not currently implement business-related or weekly cftime offset objects, which will limit the utility of inferring those kinds of frequencies for the time being, but you're definitely not required to add those in this PR!
@spencerkclark Thanks for the feedback! I rewrote the I expect the performance of Will write up some docs soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aulemahal, this is looking really nice. I made a first pass of a review.
I removed outputs that wouldn't match offsets defined in xarray: weekly frequencies, week-of-month (WOM) and all business-related anchors. It wouldn't be difficult to add them later on if need be and if
cftime_offsets.py
defines them. Note thatinfer_freq
is quite independent of the rest of xarray, this choice was purely so that the behavior matches the rest of xr.
Yeah this is totally fine. No one so far has requested these types of frequencies for cftime objects, and it sort of makes sense. While we can somewhat arbitrarily define them for non-standard calendars, the distinction between weekdays and weekends generally only really matters for the kinds of data that would be indexed with a standard calendar.
I expect the performance of
xr.infer_freq
to be slightly worst than pandas' since they use optimized and compiled code whereas I only used common numpy and xarray stuff. However, I assume that this part of the workflow is far from being the bottleneck of most xarray usecases...
It's indeed slower for cftime objects, but I think much of that is due to the slowness of CFTimeIndex.asi8
compared to DatetimeIndex.asi8
(not any code you wrote in this PR!). I agree this is likely not a bottleneck for most xarray use-cases.
Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates! I think this is almost ready to go from my perspective. I just have a few more very minor requests for tests, based on the latest coverage report:
- Test that a non-monotonic index results in
None
returned as a frequency. - Test that an index with non-unique values results in
None
returned as a frequency. - Test that an index with more than one monthly delta results in
None
as a returned frequency, e.g.["2000-01-01", "2000-02-01", "2000-04-01"]
. - Test that an index with more than one quarterly delta results in
None
as a returned frequency, e.g.["2000-01-01", "2000-04-01", "2000-10-01"]
.
Finally could you add an issue requesting that microsecond and millisecond cftime offsets be added? Those would be straightforward to add for someone motivated, but are not required as part of this PR.
Co-authored-by: Spencer Clark <spencerkclark@gmail.com> Co-authored-by: Mathias Hauser <mathause@users.noreply.github.com>
@spencerkclark Here you go! Tests added for all cases you mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @aulemahal — I’ll merge this in a few days if there are no further comments.
Thanks @aulemahal! Nice work. |
…o-combine * 'master' of github.com:pydata/xarray: (81 commits) use builtin python types instead of the numpy alias (pydata#4170) Revise pull request template (pydata#4039) pint support for Dataset (pydata#3975) drop eccodes in docs (pydata#4162) Update issue templates inspired/based on dask (pydata#4154) Fix failing upstream-dev build & remove docs build (pydata#4160) Improve typehints of xr.Dataset.__getitem__ (pydata#4144) provide a error summary for assert_allclose (pydata#3847) built-in accessor documentation (pydata#3988) Recommend installing cftime when time decoding fails. (pydata#4134) parameter documentation for DataArray.sel (pydata#4150) speed up map_blocks (pydata#4149) Remove outdated note from datetime accessor docstring (pydata#4148) Fix the upstream-dev pandas build failure (pydata#4138) map_blocks: Allow passing dask-backed objects in args (pydata#3818) keep attrs in reset_index (pydata#4103) Fix open_rasterio() for WarpedVRT with specified src_crs (pydata#4104) Allow non-unique and non-monotonic coordinates in get_clean_interp_index and polyfit (pydata#4099) update numpy's intersphinx url (pydata#4117) xr.infer_freq (pydata#4033) ...
isort -rc . && black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new APIThis PR adds a
xr.infer_freq
method to copy pandasinfer_freq
but onCFTimeIndex
objects. I tried to subclass pandas_FrequencyInferer
and to only override as little as possible.Two things are problematic right now and I would like to get feedback on how to implement them if this PR gets the dev's approval.
pd.DatetimeIndex.asi8
returns integers representing nanoseconds since 1970-1-1, whilexr.CFTimeIndex.asi8
returns microseconds. In order not to break the API, I patched the_CFTimeFrequencyInferer
to store 1000x the values. Not sure if this is the best, but it works.As of now,
xr.infer_freq
will fail on weekly indexes. This is because pandas is usingdatetime.weekday()
at some point but cftime objects do not implement that (they usedayofwk
instead). I'm not sure what to do? Cftime could implement it to completly mirror python's datetime or pandas could usedayofwk
since it's available on theTimeStamp
objects.Another option, cleaner but longer, would be to reimplement
_FrequencyInferer
from scratch. I may have time for this, cause I really think axr.infer_freq
method would be useful.