-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add load_dataset methods #301
Conversation
Since we're enabling exports to .csv (in something compatible with Series) and .json, would it also make sense to have examples of Pyleoclim ready series in there as well, including a csv/json file that would have the metadata already contained? |
Yes, I think so. Can you provide an example file and a code snippet on how it would be loaded? |
Hi @kcpevey thanks for this! I'll look at it later today. The |
In any case, let's merge this now and we can always add to it later. |
json has a low-level function that already does that: Pyleoclim_util/pyleoclim/utils/jsonutils.py Line 187 in 683fee9
@kcpevey I'll send a json file example through Slack. |
Is the work on this complete? It got merged but it looks like there is some follow on work? |
Yes, it is incomplete. Sorry I thought you wanted things merged ASAP.
Can you implement the other datasets on the list, and what is the best way to give you feedback on those? Finally, we need examples of usage somewhere. WiIl users always have to do something like |
I was just hoping for some feedback to see if you were happy with the approach I was taking before I spent too much time implementing the other datasets. I will add the others in a separate PR since this one is merged. A brief example of usage is found in the test suite here. I can add something to your docs as well. I can also look at making a shortcut. |
Hi Kim, it would be good to put usage examples in a notebook or documentation, because the pytest doesn't exactly read like a novel ;-) Re: approach, it is a bit more elaborate than what I was imagining, so I was curious why you chose to specify the metadata in a yml file instead of just defining the Series as in here, for instance. I do like the idea of specifying either the column number or the column name, so that is a nifty feature! Note that, for SOI and NINO3, I had to modify the metadata.yml file to make the unit conversions and display work properly:
Overall, your strategy works well and is very scalable, so I recommend going forward with the other datasets in the list. We might want to add ODP846 and EPICA Dome C delta D. To see how metadata should be specified for the latter, see this notebook. For ODP846, see this notebook |
@CommonClimate I chose the |
Let's keep the |
I agree, let's keep metadata.yml. |
Question on this: I'd like to write a unit test for the to_csv()/from_csv() round-trip that loads available datasets and compares them, to make sure we are getting back what we put it for each of them. Would I be able to do so with a statement like: |
By the way, we could re-use |
Yes, that will work
That seems like it might be more efficient. I was just going from an example I found in a notebook and didn't realize it was an option. |
No fault of ours! That feature didn't exist until a few commits ago! I would welcome feedback on how |
@CommonClimate you can just replace this chunk of code with your |
Adds a simplified
load_dataset
method. The example data is small and sits in the repo. A metadata.yml file has been added with details on how to load each dataset. To add a new dataset, add the details to the yaml file. Currently the load_dataset method only accepts csv, but its easily extendible.The yaml file is fairly self explanatory except for a few details:
paleo_kwargs
are extra kwargs that are passed on topyleoclim.Series
constructor,pandas_kwargs
are extra kwargs that are passed on to the pd.DataFrame constructor. Lastly, the two variablestime_column
andvalue_column
indicate the column in the data to find these. I've set it up so that you can specify an int (which will pull the column using.iloc
) or a str (which will pull usingdf[str]
.There are helper functions to load the metadata for individual datasets and to discover what datasets are available.
TODO:
Resolves: LinkedEarth/paleoPandas#21