add load_dataset methods #301

kcpevey · 2023-01-26T16:54:24Z

Adds a simplified load_dataset method. The example data is small and sits in the repo. A metadata.yml file has been added with details on how to load each dataset. To add a new dataset, add the details to the yaml file. Currently the load_dataset method only accepts csv, but its easily extendible.

The yaml file is fairly self explanatory except for a few details: paleo_kwargs are extra kwargs that are passed on to pyleoclim.Series constructor, pandas_kwargs are extra kwargs that are passed on to the pd.DataFrame constructor. Lastly, the two variables time_column and value_column indicate the column in the data to find these. I've set it up so that you can specify an int (which will pull the column using .iloc) or a str (which will pull using df[str].

There are helper functions to load the metadata for individual datasets and to discover what datasets are available.

TODO:

add additional datasets
switch over the other tests to use the local data (and this loader) instead of pulling from the internet.

Resolves: LinkedEarth/paleoPandas#21

khider · 2023-01-26T19:38:57Z

Since we're enabling exports to .csv (in something compatible with Series) and .json, would it also make sense to have examples of Pyleoclim ready series in there as well, including a csv/json file that would have the metadata already contained?

kcpevey · 2023-01-26T19:40:36Z

Yes, I think so. Can you provide an example file and a code snippet on how it would be loaded?

CommonClimate · 2023-01-26T19:56:06Z

Hi @kcpevey thanks for this! I'll look at it later today. The to_csv() and to_json() methods are brand new. I'm working on the from_csv() counterpart (stubbed out last night; hope to finish it today). @khider are you planning a from_json() method for symmetry or is it not needed? In any case, we need to provide notebook examples. I'll make a notebook with the EPICA Dome C CO2 record read from and written to csv. Can you make an equivalent notebook for a JSON dataset?

CommonClimate · 2023-01-26T19:56:38Z

In any case, let's merge this now and we can always add to it later.

khider · 2023-01-26T19:59:07Z

json has a low-level function that already does that:

Pyleoclim_util/pyleoclim/utils/jsonutils.py

Line 187 in 683fee9

def json_to_PyleoObj(filename,objname):

, so we can just wrap around it.

@kcpevey I'll send a json file example through Slack.

kcpevey · 2023-01-30T16:41:06Z

Is the work on this complete? It got merged but it looks like there is some follow on work?

CommonClimate · 2023-01-30T18:26:06Z

Yes, it is incomplete. Sorry I thought you wanted things merged ASAP.
What's missing:

only 2 datasets are currently loading now (NINO3, SOI)
some of the metadata need to be adjusted. (e.g. for SOI, time_name should 'time', time_units = 'Year CE')

Can you implement the other datasets on the list, and what is the best way to give you feedback on those?

Finally, we need examples of usage somewhere. WiIl users always have to do something like ts = pyleo.utils.datasets.load_dataset('dataset_name') or can we enable a shortcut like ts = pyleo.load_dataset('dataset_name')?

kcpevey · 2023-01-30T18:32:11Z

I was just hoping for some feedback to see if you were happy with the approach I was taking before I spent too much time implementing the other datasets. I will add the others in a separate PR since this one is merged.

A brief example of usage is found in the test suite here. I can add something to your docs as well.

I can also look at making a shortcut.

CommonClimate · 2023-01-31T00:13:43Z

Hi Kim, it would be good to put usage examples in a notebook or documentation, because the pytest doesn't exactly read like a novel ;-)

Re: approach, it is a bit more elaborate than what I was imagining, so I was curious why you chose to specify the metadata in a yml file instead of just defining the Series as in here, for instance. I do like the idea of specifying either the column number or the column name, so that is a nifty feature! Note that, for SOI and NINO3, I had to modify the metadata.yml file to make the unit conversions and display work properly:

soi:
  filename: 'soi_data'
  file_extension: 'csv'
  pandas_kwargs:
    skiprows: 0
    header: 1
  pyleo_kwargs:
    time_name: 'time'
    time_unit: 'year C.E.'
    value_name: 'SOI'
    value_unit: 'mb'
    label: 'Southern Oscillation Index'
  time_column: 1
  value_column: 2

nino3:
  filename: 'wtc_test_data_nino_even'
  file_extension: 'csv'
  pandas_kwargs:
    header: 0
  pyleo_kwargs:
    time_name: 'time'
    time_unit: 'year C.E.'
    value_name: 'NINO3'
    value_unit: '$^{\circ}$C'
  time_column: t
  value_column: nino

Overall, your strategy works well and is very scalable, so I recommend going forward with the other datasets in the list. We might want to add ODP846 and EPICA Dome C delta D. To see how metadata should be specified for the latter, see this notebook. For ODP846, see this notebook

kcpevey · 2023-02-02T15:05:01Z

@CommonClimate I chose the metadata.yml because it keeps all the loading information in a separate place (with the data) rather than in the code. I could get rid of the metadata yml, and add an if statement for each dataset within the load_dataset function itself. That approach also has merit, but it could also lead to a very bloated load_dataset function if you have a bunch of examples. I'm happy to go either direction with this which is why I only put two datasets in and was hoping for feedback before I moved forward. Which way would you prefer?

khider · 2023-02-02T18:59:08Z

Let's keep the .yml file. It will be easier to re-edit later on.

CommonClimate · 2023-02-02T22:31:58Z

I agree, let's keep metadata.yml.

CommonClimate · 2023-02-02T22:36:22Z

Question on this: I'd like to write a unit test for the to_csv()/from_csv() round-trip that loads available datasets and compares them, to make sure we are getting back what we put it for each of them. Would I be able to do so with a statement like:
@pytest.mark.parametrize('dataset', pyleo.utils.datasets.available_dataset_names())

CommonClimate · 2023-02-02T22:37:09Z

By the way, we could re-use from_csv() within load_dataset(), if it makes sense.

kcpevey · 2023-02-03T17:10:43Z

Question on this: I'd like to write a unit test for the to_csv()/from_csv() round-trip that loads available datasets and compares them, to make sure we are getting back what we put it for each of them. Would I be able to do so with a statement like:
@pytest.mark.parametrize('dataset', pyleo.utils.datasets.available_dataset_names())

Yes, that will work

By the way, we could re-use from_csv() within load_dataset(), if it makes sense.

That seems like it might be more efficient. I was just going from an example I found in a notebook and didn't realize it was an option.

CommonClimate · 2023-02-03T21:27:49Z

No fault of ours! That feature didn't exist until a few commits ago! I would welcome feedback on how from_csv() is implemented, however, as it is a little kludgy.

kcpevey · 2023-02-07T17:34:25Z

@CommonClimate you can just replace this chunk of code with your from_csv method https://github.com/LinkedEarth/Pyleoclim_util/pull/301/files#diff-5703b825733c2cd665321640542fe070baa52b1521cf47c45be774b5e3547682R83-R106

add load_dataset methods

ab7018c

khider marked this pull request as ready for review January 26, 2023 19:31

CommonClimate merged commit 6f1d0f5 into LinkedEarth:new-objects Jan 26, 2023

kcpevey mentioned this pull request Feb 3, 2023

load_dataset() method LinkedEarth/paleoPandas#21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add load_dataset methods #301

add load_dataset methods #301

kcpevey commented Jan 26, 2023 •

edited

Loading

khider commented Jan 26, 2023

kcpevey commented Jan 26, 2023

CommonClimate commented Jan 26, 2023

CommonClimate commented Jan 26, 2023

khider commented Jan 26, 2023

kcpevey commented Jan 30, 2023

CommonClimate commented Jan 30, 2023

kcpevey commented Jan 30, 2023

CommonClimate commented Jan 31, 2023

kcpevey commented Feb 2, 2023

khider commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

kcpevey commented Feb 3, 2023

CommonClimate commented Feb 3, 2023

kcpevey commented Feb 7, 2023

add load_dataset methods #301

add load_dataset methods #301

Conversation

kcpevey commented Jan 26, 2023 • edited Loading

khider commented Jan 26, 2023

kcpevey commented Jan 26, 2023

CommonClimate commented Jan 26, 2023

CommonClimate commented Jan 26, 2023

khider commented Jan 26, 2023

kcpevey commented Jan 30, 2023

CommonClimate commented Jan 30, 2023

kcpevey commented Jan 30, 2023

CommonClimate commented Jan 31, 2023

kcpevey commented Feb 2, 2023

khider commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

kcpevey commented Feb 3, 2023

CommonClimate commented Feb 3, 2023

kcpevey commented Feb 7, 2023

kcpevey commented Jan 26, 2023 •

edited

Loading