Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP stub out pandas roundtrip #289

Merged
merged 4 commits into from
Jan 11, 2023

Conversation

MarcoGorelli
Copy link

@MarcoGorelli MarcoGorelli commented Jan 11, 2023

As discussed with @CommonClimate and @khider

The functions that still need populating are:

  • time_unit_to_datum_exp_dir
  • datum_exp_dir_to_time_unit
  • convert_datetime_index_to_time

I think this should be possible using the conversion rules defined by @CommonClimate :

Conversion rules are given here in pseudocode:
If time_unit contains ‘ky’, ‘kyr’, ‘kyrs’, ‘kiloyear’, or ‘ka’, then:
Exponent = 3, datum = “1950" (a string, to be converted internally to the datetime index corresponding to Jan 1, 1950 AD), direction = retrograde
If time_unit contains ‘y’, ‘yr’, ‘yrs’, ‘year’, ‘year CE’ ‘years CE’ or ‘ year(s) AD’, then:
Exponent = 0, datum = “0” , (or the datetime index corresponding to year zero), direction = prograde.
If time_unit == “BP”’, then:
datum = “1950” , direction = retrograde; ask for user input on exponent
If time_unit contains “yr BP” ,“yrs BP” or “years BP”, then:
datum = “1950" , direction = retrograde; exponent = 0
If time_unit contains ‘Ma’ or ‘My’, direction =” retrograde, datum = “1950”, exponent = 6
If time_unit contains ‘Ga’ or ‘Gy’, direction = retrograde, datum = “1950", exponent = 3
If lower(time_name) contains “age”, direction = retrograde, datum = “1950", ask for user input on exponent unless it can be guessed from time_unit as above (e.g. “kyr” is also in the string).
If lower(time_name) contains “year”
Exponent = 0, datum = 0 , (or the datetime index corresponding to year zero), direction = prograde.

@MarcoGorelli MarcoGorelli changed the title pandas roundtrip stub out pandas roundtrip Jan 11, 2023
@MarcoGorelli MarcoGorelli changed the title stub out pandas roundtrip WIP stub out pandas roundtrip Jan 11, 2023
@MarcoGorelli MarcoGorelli marked this pull request as draft January 11, 2023 11:00
@CommonClimate
Copy link
Collaborator

Hi @MarcoGorelli that was fast! However, could you please make this pull request into the Development branch instead? We won't be tinkering around with this in Master for a while.

@MarcoGorelli MarcoGorelli changed the base branch from master to Development January 11, 2023 20:27
@MarcoGorelli
Copy link
Author

sure, have rebased!

@khider
Copy link
Member

khider commented Jan 11, 2023

Thanks! Waiting for Pytest to complete and I'll merge. Question: are there tests for these new functionalities?

@CommonClimate
Copy link
Collaborator

@MarcoGorelli one issue is that Pyleoclim currently uses pandas 1.5.2, so I'm not sure if the non-ns datetime stuff will work yet. Do you need us to up the required version of pandas?

@MarcoGorelli
Copy link
Author

Might be a bit early to merge, I was hoping to get feedback first on whether this is what you wanted - in particular on the time_unit_to_datum_exp_dir function (where you're the domain experts)

A question I have is that it's currently possible to initialise a pyleoclim.Series with passing time_unit. In this case, what should .datetime_index return - should it raise NotImplementedError?

Regarding tests - I haven't added them just yet, I just wanted to make sure we were aligned on requirements / functionality first

And re-pandas version - yeah you'll require the nightly build for this to work. Or just wait until version 2.0.0 to come out (should be some time in February), and in the meantime this could just live in some separate branch whilst it's being refined

@khider
Copy link
Member

khider commented Jan 11, 2023

Oh, then I needs to go into the new-objects branch, which is where we're working on retooling everything and working on Pandas compatibility. This branch can have changes regarding its requirements (i.e., Pandas) or a new environment file so we get the nightly updates without affecting any of our other in-progress work.

@MarcoGorelli MarcoGorelli changed the base branch from Development to new-objects January 11, 2023 21:05
@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 11, 2023 21:05
@MarcoGorelli
Copy link
Author

ah, nice - sure, done!

@khider khider merged commit bed3d06 into LinkedEarth:new-objects Jan 11, 2023
@MarcoGorelli
Copy link
Author

Hi @khider - just for my understanding, was it your intention to merge this so you can try it out, and then (once you've tried it out) to provide feedback?

@khider
Copy link
Member

khider commented Jan 11, 2023

Yes. I probably will work on this on Friday. I'm traveling tomorrow. But maybe @CommonClimate can give it a try too?

@CommonClimate
Copy link
Collaborator

I won't have time before Friday, but I can play with that after my class is done at 10

@CommonClimate
Copy link
Collaborator

@MarcoGorelli a couple of thoughts occurred to me yesterday:

  • first, @nickmckay pointed out that if time_unit contains ‘Ga’ or ‘Gy’, direction = retrograde, datum = “1950", exponent = 9 (not 3 as I initially wrote ; that was a typo due to mindless cut&paste).
  • along those lines, the generation of the DateTimeIndex needs to be cognizant of the exponent. Right now, I see that PaleoDtype always uses np.dtype("M8[s]"), for which I'm not finding any documentation. It's possible that "s" is the most appropriate interval in 64 bits, but knowledge of the exponent would let you know what you care about. Basically, when you're dealing with a Giga-annum (1 Ga = 10^9 years), you couldn't care less about anything less than a year (arguably, 1000 years). Perhaps exponent could be used in the generation of the DateTimeIndex to reflect this changing resolution? This may be a non-issue, but I thought I would raise it in case there are benefits to changing the time resolution as a function of the units.

@MarcoGorelli
Copy link
Author

MarcoGorelli commented Jan 12, 2023

pandas only supports resolutions 's', 'ms', 'us', and 'ns' - 's' should be enough for any date you need to represent

Regarding getting the datum / exponent / direction from time unit - sure, do you want to open a PR to fix that? It should just be a matter of changing the rules in

def time_unit_to_datum_exp_dir(time_unit):
if time_unit == 'years':
datum = '0'
exponent = 0
direction = 'prograde'
elif time_unit in ('ky', 'kyr', 'kyrs', 'kiloyear', 'ka'):
datum = '1950'
exponent = 3
direction = 'retrograde'
elif time_unit in ('y', 'yr', 'yrs', 'year', 'year CE', 'years CE', 'year(s) AD'):
datum = '0'
exponent = 0
direction = 'prograde'
elif time_unit == 'BP':
datum = '1950'
exponent = 0
direction = 'retrograde'
elif time_unit in ('yr BP', 'yrs BP', 'years BP'):
datum ='1950'
exponent = 0
direction = 'retrograde'
elif time_unit in ('Ma', 'My'):
datum ='1950'
exponent = 6
direction = 'retrograde'
elif time_unit in ('Ga', 'Gy'):
datum ='1950'
exponent = 3
direction = 'retrograde'
else:
raise ValueError(f'Time unit {time_unit} not supported')
return (datum, exponent, direction)

@MarcoGorelli
Copy link
Author

PaleoDtype always uses np.dtype("M8[s]")

Just to precise with terminology here - PaleoDtype was a previous suggestion I'd made for when I thought you wanted an Extension Array. We've gone back on that, in favour of to_pandas / from_pandas / datetime_index / pandas_method on a pyleoclim.Series

@CommonClimate
Copy link
Collaborator

OK. So, to confirm, we're just using plain pandas now, not the paleopandas package you guys put together? Is that on the chopping block now?

Re: Pyleoclim, I'm going through the code now to try and understand what you did. Please bear with me lowly domain scientist with no CS background.

@property
    def datetime_index(self):
        datum, exponent, direction = tsutils.time_unit_to_datum_exp_dir(self.time_unit)
        if direction == 'prograde':
            op = operator.add
        elif direction == 'retrograde':
            op = operator.sub
        else:
            raise ValueError(f'Expected one of {"prograde", "retrograde"}, got {direction}')

        timedelta = self.time * 10**exponent
        years = timedelta.astype('int')
        seconds = ((timedelta - timedelta.astype('int')) * tsutils.SECONDS_PER_YEAR).astype('timedelta64[s]')
        
        np_times = op(op(int(datum), years).astype(str).astype('datetime64[s]'), seconds)
        return pd.DatetimeIndex(np_times, name=self.time_name)

Why define this as a property of the Series class rather than a method? Because it takes no argument? What if it used exponent as discussed above?

Also, you mention datum_exp_dir_to_time_unit in this thread but I don't see it in tsutils.py ; still needs to be built out, correct?

@MarcoGorelli
Copy link
Author

Yes, that's right, if this is more aligned with what you want then let's axe paleopandas

Why define this as a property of the Series class rather than a method? Because it takes no argument? What if it used exponent as discussed above?

It doesn't have to be a property, it would just be analogous to how you call .index on a pandas Series with no arguments.
exponent is calculated in the first line of the function from self.time_unit, so it doesn't need to be an argument

Also, you mention datum_exp_dir_to_time_unit in this thread but I don't see it in tsutils.py ; still needs to be built out, correct?

It's there, and was added in this PR:

def time_unit_to_datum_exp_dir(time_unit):
if time_unit == 'years':
datum = '0'
exponent = 0
direction = 'prograde'
elif time_unit in ('ky', 'kyr', 'kyrs', 'kiloyear', 'ka'):
datum = '1950'
exponent = 3
direction = 'retrograde'
elif time_unit in ('y', 'yr', 'yrs', 'year', 'year CE', 'years CE', 'year(s) AD'):
datum = '0'
exponent = 0
direction = 'prograde'
elif time_unit == 'BP':
datum = '1950'
exponent = 0
direction = 'retrograde'
elif time_unit in ('yr BP', 'yrs BP', 'years BP'):
datum ='1950'
exponent = 0
direction = 'retrograde'
elif time_unit in ('Ma', 'My'):
datum ='1950'
exponent = 6
direction = 'retrograde'
elif time_unit in ('Ga', 'Gy'):
datum ='1950'
exponent = 9
direction = 'retrograde'
else:
raise ValueError(f'Time unit {time_unit} not supported')
return (datum, exponent, direction)

Perhaps you need to do a git pull?

@CommonClimate
Copy link
Collaborator

I'm talking about the reverse function: generating time_unit from the 3 parameters. Though I'm not sure in which use case that would apply.

@MarcoGorelli
Copy link
Author

ah - I haven't written that one, and don't think it would be necessary (indeed, I don't think the mapping would be 1-1?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants