Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove datetime workaround logic #273

Merged
merged 17 commits into from
Jun 4, 2018

Conversation

spencerkclark
Copy link
Collaborator

@spencerkclark spencerkclark commented May 19, 2018

Closes #94
Closes #98
Closes #243

Now that Xarray version 0.10.4 has been released, we can begin the process of removing our logic for working around Timestamp limitations in pandas. For the most part this was actually pretty painless.

A few notes:

  • One breaking change is that now instead of using datetime.datetime objects as input for specifying dates or time ranges, we require either np.datetime64 or cftime.datetime objects. This is because some dates, e.g. February 30th in a 360 day calendar, are not representable as datetime.datetime objects. Updated to preserve support for datetime.datetime objects.
  • Eventually it would be good to accept strings as inputs (i.e. for partial datetime string indexing), but I think we should save that for a future PR. Strings are nice, because they are calendar-agnostic. I went ahead and added support for this in this PR.
  • One drawback of adopting this full-on is that resample is currently not possible on data indexed by a CFTimeIndex. We do use resample when computing transient eddy quantities (to downsample to a monthly mean time series). In the meantime though, we might think about creating a one-off function for that purpose (i.e. not worry about supporting general frequency resampling and only handle the monthly case).
  • Since cftime.datetime objects are not exact to microsecond precision, I needed to add a tolerance to assert_has_data_for_time as discussed in Selecting datetimes within a certain range (days) without having to specify the start and end dates precisely (hours or higher) #243.
  • Lastly, I still need to update the documentation to reflect these changes (and add a what's new entry).

@spencerahill does this seem reasonable (in particular the breaking change)? If it does, I'll try to clean things up over the next few days.

@spencerkclark
Copy link
Collaborator Author

@spencerahill given conda-forge/cftime-feedstock#2 (comment) I think it makes sense to switch to testing on the Win-64 platform rather than Win-32; do you agree?

@spencerahill
Copy link
Owner

Awesome! I wanted to give you a quick response on a few things, but then unfortunately this next week is really crazy for me. So it's unlikely that I'll be able to give a proper review until the week of the 27th. Sorry about that.

Don't forget to pin xarray version to >=0.10.4

Win-64 platform rather than Win-32; do you agree?

Yes, agreed.

Eventually it would be good to accept strings as inputs (i.e. for partial datetime string indexing), but I think we should save that for a future PR.

Do you have a sense of how difficult this will be to implement? I don't want to blow up the scope of this PR, which is already necessarily pretty large. But this would be a great feature and would be nice to go straight to there instead of requiring switching from Python's datetime objects to numpy's and/or cftime's.

Or, conversely, would it be possible to still accept datetime.datetime in cases with standard calendars? I haven't kept up with the technical details of the calendars.

We do use resample when computing transient eddy quantities (to downsample to a monthly mean time series). In the meantime though, we might think about creating a one-off function for that purpose (i.e. not worry about supporting general frequency resampling and only handle the monthly case).

Given our intention of generalizing our time reduction functionality soon (#208), how does that modify this picture? Conversely, what are the prospects for getting resample support added to CFTimeIndex?


One last thought: we're probably due for a new release, v0.3, once we get this in...even without it we've already done quite a bit since 0.2. So probably in June or July.

@spencerkclark
Copy link
Collaborator Author

spencerkclark commented May 19, 2018

next week is really crazy for me. So it's unlikely that I'll be able to give a proper review until the week of the 27th. Sorry about that.

Totally fine! Thanks for taking the time to write these initial comments.

Do you have a sense of how difficult this will be to implement? I don't want to blow up the scope of this PR, which is already necessarily pretty large. But this would be a great feature and would be nice to go straight to there instead of requiring switching from Python's datetime objects to numpy's and/or cftime's.

For the most part adding string support hopefully shouldn't be too hard. The one place I'm a little unsure of is how to handle things in assert_has_data_for_time (our check to make sure that data exists for the full range specified by the user). To handle string inputs there I would need something akin to xarray.coding.cftimeindex._parsed_string_to_bounds to convert those partial strings to cftime.datetime or np.datetime64 date bounds. One simple option would be to just disable that check for string inputs.

Or, conversely, would it be possible to still accept datetime.datetime in cases with standard calendars? I haven't kept up with the technical details of the calendars.

I think this is reasonably straightforward. We can support all three types; if a datetime.datetime is provided, we will convert it to the type that the dates from the underlying dataset are converted to. If you need to specify dates that cannot be represented as datetime.datetime objects (which admittedly is rare) you'll just need to use cftime.datetime objects.

Given our intention of generalizing our time reduction functionality soon (#208), how does that modify this picture? Conversely, what are the prospects for getting resample support added to CFTimeIndex?

I think generalizing the time reduction functionality is somewhat orthogonal to this. If we wanted to support eddy calculations we would need support of monthly resampling in some fashion. Getting general resample support added for CFTimeIndex in xarray is doable. Stephan has outlined roughly how this might look in pydata/xarray#1252 (comment).

One last thought: we're probably due for a new release, v0.3, once we get this in...even without it we've already done quite a bit since 0.2. So probably in June or July.

I agree!

aospy will now automatically convert user-input date ranges to the types
required for indexing the data read in.  Accepted types are:
- np.datetime64
- datetime.datetime
- cftime.datetime
@spencerahill
Copy link
Owner

Thanks for going ahead with the datetime.datetime and string implementations! Full review to come; for now a few top-level thoughts:

I think generalizing the time reduction functionality is somewhat orthogonal to this. If we wanted to support eddy calculations we would need support of monthly resampling in some fashion.

OK. In that case I am receptive to your initial idea of "creating a one-off function for that purpose"

In thinking about this, it occurs to me that the eddy functionality is one of the uglier bits of Calc. We should decide what to do with it regardless of this PR. I opened #274 to track that. But we shouldn't let that hold up this PR --- maybe the wisest move is to just replace the resample call there with a NotImplementedError.

Getting general resample support added for CFTimeIndex in xarray is doable. Stephan has outlined roughly how this might look in pydata/xarray#1252 (comment).

OK. I opened pydata/xarray#2191 to keep tabs on that endeavor.

@spencerkclark
Copy link
Collaborator Author

@spencerahill thanks for opening those issues! I'm currently cleaning up the tests here (moving test_calc_basic.py and test_data_loader.py away from unit test classes). Once I'm done with that, I'll ping you for a full review.

@spencerkclark
Copy link
Collaborator Author

@spencerahill I've finished up cleaning up the tests; the main motivation for the refactor was so that I could parametrize the tests using different types of dates.

In thinking about this, it occurs to me that the eddy functionality is one of the uglier bits of Calc. We should decide what to do with it regardless of this PR. I opened #274 to track that. But we shouldn't let that hold up this PR --- maybe the wisest move is to just replace the resample call there with a NotImplementedError.

Sounds good -- I've left things alone with that logic in this PR for now. I was not using that functionality either. Whatever worked with data indexed by a DatetimeIndex will still work here, and it will just error if a CFTimeIndex is used.

Copy link
Owner

@spencerahill spencerahill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good overall. Most of my comments are about docs/comments or are clarifying questions.

I do have one bigger question though: why are strings handled differently throughout than the other time types? In my (quite possibly naive) mind, I was thinking you'd cast string arguments to a datetime-like type (not sure which one) right away, and then all the other functions/methods could count on a datetime-like type. As written, there is if/else logic for strings vs. datetime-like objects in a lot of places.

@@ -180,7 +179,6 @@ def _prep_time_data(ds):
The processed Dataset and minimum and maximum years in the loaded data
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring now that only ds is returned


Parameters
----------
date : str
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str -> datetime-like object or str


Parameters
----------
date : datetime-like object
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime-like obj or str


Returns
-------
pd.Timestamp or cftime.datetime
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the conditions under which either type is returned.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this; this function was was leftover from my earlier commits and is no longer even used! I'll go ahead and delete it.



def maybe_convert_to_index_date_type(index, date):
"""Convert a datetime-like object to the appropriate type
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Here and elsewhere, let's end all the one-line summaries in the docstrings with a period. I think it's a little cleaner, and it's more consistent with other packages docs. I realize we probably haven't been 100% consistent on this in the past.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of "appropriate type" isn't clear in this docstring...one or two sentences expanding this would help in cases like this one where a developer is returning to this after a while or is new to it.

@@ -421,6 +315,94 @@ def test_assert_has_data_for_time():
_assert_has_data_for_time(da, start_date_bad, end_date_bad)


CFTIME_DATE_TYPES = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see here you test all of the calendar types (c.f. my other comment). If you think that it's sufficient to do so at this lower level, then fine by me.

Copy link
Collaborator Author

@spencerkclark spencerkclark May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right to strive to be thorough in testing all kinds of cftime.datetime objects at this low level. I could be more thorough in these tests (particularly for infer_year and maybe_convert_to_index_date_type). I'll improve those.

start_date_bad = '1999-12-31'
end_date_bad = '2000-04-01'

# With strings these checks are disabled
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this difference in behavior between strings and datetime-like types?

Copy link
Collaborator Author

@spencerkclark spencerkclark May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial datetime strings are somewhat complicated. It involves a fair bit of code (and tests) to fully support parsing them and interpreting what they mean as bounds. While much of that code exists in xarray (see here in cftimeindex.py), it is private API. At least to me, this sanity check in aospy is not important enough to merit suggesting that xarray support parse_iso8601, _parse_iso8601_with_reso , and _parsed_string_to_bounds as public API. While they are used by xarray, these functions are outside its scope (i.e. xarray is not a date-parsing library).

If we want to support this check with strings, it then leaves us with two options:

  • Port all of that code (and tests) to aospy.
  • Use those functions from xarray, despite the fact that they are private API.

In your opinion, are either of those worth it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's helpful. Is this the same reason for the different handling of strings vs. other types elsewhere in the code? Frankly, it does concern me a little that, for a user, behavior could be different whether they supply dates as strings or as a datetime-like type.

Putting the above question another way: let's suppose we went ahead and ported that code or imported it. Would that enable you to, as soon as dates are read in as strings, convert them to a datetime-like type? And, if yes, would that then remove the need for separate logic? Purely from a design perspective, to me the optimal solution would be to create an "adapter" (something that takes strings or any of the supported datetime-like types and outputs a datetime object), apply it right away, and then have the remainder of the pipeline ingest/output a single type. But from the technical/implementation perspective, perhaps there are good reasons that would argue against this that I'm still not appreciating.

Hope that makes sense!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that enable you to, as soon as dates are read in as strings, convert them to a datetime-like type?

I'm pretty reluctant to do that :) The beauty of using strings is that we don't need to modify them at all! We can just pass them directly as slice bounds when we call sel to subset a Dataset in time, regardless of whether the time coordinate is indexed by a DatetimeIndex or CFTimeIndex. For the most part that is why the logic is different for them throughout what I've implemented here. If it weren't for _assert_has_data_for_time, strings would be significantly easier to deal with than date objects (because we don't need to worry about their type).

Let me think about things a little more. There may be a solution that doesn't involve porting significant amounts of existing code or using private API.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes sense to me. In that case, what about, just within _assert_has_data_for_time, casting the string to one of the datetime objects? Sorry if I'm still being naive here. If not, at least maybe we should add a note in the docs and/or a logging message that string dates don't undergo this check?

Ultimately I trust your judgment on this one and don't want to bog you down on this point. I don't think at all that the current implementation will incur a lot of technical debt (quite the opposite in fact). So let's go with whatever you decide is best at this point.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is exactly what I had in mind when discussing the options here: #273 (comment).

Sorry, right of course. So that probably is overkill; I think a warning/docs note would be fine.

Would it help you understand where I'm coming from if I provided an explicit example for why casting string inputs as dates is difficult and something that might best be left for upstream libraries?

An example would definitely help, but that might be overkill just for my behalf. On the other hand, it occurred to me: would this functionality be appropriate in cftime? In which case maybe a clean example would be good as part of an Issue on it in that repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example would definitely help

For sure! Let me provide some context here.

Ideally we would like to allow users to specify partial-datetime strings as time bounds for their calculations. Partial-datetime strings are a very convenient shorthand; they allow you to select data between the earliest datetime possible for the lower bound, all the way to the greatest datetime possible on the upper bound.

For instance, if you wanted to compute things for data between year 2000 and 2010 (inclusive), you could just specify a date range of ('2000', '2010'). This is in contrast to using explicit datetimes, which would look like (datetime(2000, 1, 1), datetime(2010, 12, 31, 23, 59, 59, 999999)). You can continue refining your interval by filling in more of the bounds. So for instance if you wanted to select data between February, 2000 and February 2010, you could use ('2000-02', '2010-02') and this would map to (datetime(2000, 2, 1), datetime(2010, 2, 28, 23, 59, 59, 999999)).

For this reason, it is not a simple matter of "casting" a string to a particular datetime. The datetime you transform it to matters based on the context. For instance if '2000-02' is the lower bound, then it maps to datetime(2000, 2, 1), while if it is the upper bound it maps to datetime(2000, 2, 29, 23, 59, 59, 999999). pandas and now xarray have logic for doing this for numpy and cftime dates, so it sort of feels heavy-handed to re-implement it in aospy.

So that probably is overkill; I think a warning/docs note would be fine.

Sounds good. I'll mull this over for the next few days. If I can't think of a simple solution, I'll go forward with the warning/docs note.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes total sense. Thanks.

I'll mull this over for the next few days. If I can't think of a simple solution, I'll go forward with the warning/docs note.

Sounds good. My only remaining question is re: if this functionality should/could go in cftime eventually. Perhaps Jeremy's comment on this thread on the xarray mailing list is relevant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only remaining question is re: if this functionality should/could go in cftime eventually. Perhaps Jeremy's comment on this thread on the xarray mailing list is relevant?

I agree with Jeremy and think a cftime version of datetime.strptime would be great; it could help in this case.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I opened an issue to start that discussion: Unidata/cftime#51

Who knows if/when that will progress though, so let's not wait up for it for this PR.

@@ -472,6 +397,9 @@ def _assert_has_data_for_time(da, start_date, end_date):
AssertionError
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above in this docstring: netCDF4.netcdftime -> cftime (maybe grep for the former throughout the package)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This turns out to be the only place (and I fixed it).

int
"""
if isinstance(date, str):
pattern = '(?P<year>\d{4})'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to describe in words what this pattern is looking for, and/or an example?

description='Precipitation generated by convective parameterization',
)
precip_largescale = Var(
name='precip_largescale', # name used by aospy
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to retain the 4 spaces indentation on the continuation lines

@spencerkclark spencerkclark changed the title WIP Remove datetime workaround logic Remove datetime workaround logic May 29, 2018
@spencerkclark
Copy link
Collaborator Author

Thanks for the helpful comments @spencerahill! When you get a chance, this should be ready for another review.

@spencerahill
Copy link
Owner

Thanks @spencerkclark for the new commits...everything looks good, but still do want to make sure I understand the design decisions c.f. my in-line comment above before merging.

@spencerkclark
Copy link
Collaborator Author

@spencerahill thanks for opening the issue in cftime! I couldn't think of a simple solution, so I went ahead with the warning and docs note. This should be ready for a final review.

@spencerahill
Copy link
Owner

Awesome. Thanks much @spencerkclark! Great to finally have this in.

@spencerahill spencerahill merged commit e370fda into spencerahill:develop Jun 4, 2018
@spencerkclark spencerkclark deleted the cftimeindex branch June 4, 2018 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants