Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add load_dataset methods #301

Merged
merged 1 commit into from
Jan 26, 2023

Conversation

kcpevey
Copy link
Contributor

@kcpevey kcpevey commented Jan 26, 2023

Adds a simplified load_dataset method. The example data is small and sits in the repo. A metadata.yml file has been added with details on how to load each dataset. To add a new dataset, add the details to the yaml file. Currently the load_dataset method only accepts csv, but its easily extendible.

The yaml file is fairly self explanatory except for a few details: paleo_kwargs are extra kwargs that are passed on to pyleoclim.Series constructor, pandas_kwargs are extra kwargs that are passed on to the pd.DataFrame constructor. Lastly, the two variables time_column and value_column indicate the column in the data to find these. I've set it up so that you can specify an int (which will pull the column using .iloc) or a str (which will pull using df[str].

There are helper functions to load the metadata for individual datasets and to discover what datasets are available.

TODO:

  • add additional datasets
  • switch over the other tests to use the local data (and this loader) instead of pulling from the internet.

Resolves: LinkedEarth/paleoPandas#21

@khider khider marked this pull request as ready for review January 26, 2023 19:31
@khider
Copy link
Member

khider commented Jan 26, 2023

Since we're enabling exports to .csv (in something compatible with Series) and .json, would it also make sense to have examples of Pyleoclim ready series in there as well, including a csv/json file that would have the metadata already contained?

@kcpevey
Copy link
Contributor Author

kcpevey commented Jan 26, 2023

Yes, I think so. Can you provide an example file and a code snippet on how it would be loaded?

@CommonClimate
Copy link
Collaborator

Hi @kcpevey thanks for this! I'll look at it later today. The to_csv() and to_json() methods are brand new. I'm working on the from_csv() counterpart (stubbed out last night; hope to finish it today). @khider are you planning a from_json() method for symmetry or is it not needed? In any case, we need to provide notebook examples. I'll make a notebook with the EPICA Dome C CO2 record read from and written to csv. Can you make an equivalent notebook for a JSON dataset?

@CommonClimate CommonClimate merged commit 6f1d0f5 into LinkedEarth:new-objects Jan 26, 2023
@CommonClimate
Copy link
Collaborator

In any case, let's merge this now and we can always add to it later.

@khider
Copy link
Member

khider commented Jan 26, 2023

json has a low-level function that already does that:

def json_to_PyleoObj(filename,objname):
, so we can just wrap around it.

@kcpevey I'll send a json file example through Slack.

@kcpevey
Copy link
Contributor Author

kcpevey commented Jan 30, 2023

Is the work on this complete? It got merged but it looks like there is some follow on work?

@CommonClimate
Copy link
Collaborator

Yes, it is incomplete. Sorry I thought you wanted things merged ASAP.
What's missing:

  • only 2 datasets are currently loading now (NINO3, SOI)
  • some of the metadata need to be adjusted. (e.g. for SOI, time_name should 'time', time_units = 'Year CE')

Can you implement the other datasets on the list, and what is the best way to give you feedback on those?

Finally, we need examples of usage somewhere. WiIl users always have to do something like ts = pyleo.utils.datasets.load_dataset('dataset_name') or can we enable a shortcut like ts = pyleo.load_dataset('dataset_name')?

@kcpevey
Copy link
Contributor Author

kcpevey commented Jan 30, 2023

I was just hoping for some feedback to see if you were happy with the approach I was taking before I spent too much time implementing the other datasets. I will add the others in a separate PR since this one is merged.

A brief example of usage is found in the test suite here. I can add something to your docs as well.

I can also look at making a shortcut.

@CommonClimate
Copy link
Collaborator

Hi Kim, it would be good to put usage examples in a notebook or documentation, because the pytest doesn't exactly read like a novel ;-)

Re: approach, it is a bit more elaborate than what I was imagining, so I was curious why you chose to specify the metadata in a yml file instead of just defining the Series as in here, for instance. I do like the idea of specifying either the column number or the column name, so that is a nifty feature! Note that, for SOI and NINO3, I had to modify the metadata.yml file to make the unit conversions and display work properly:

soi:
  filename: 'soi_data'
  file_extension: 'csv'
  pandas_kwargs:
    skiprows: 0
    header: 1
  pyleo_kwargs:
    time_name: 'time'
    time_unit: 'year C.E.'
    value_name: 'SOI'
    value_unit: 'mb'
    label: 'Southern Oscillation Index'
  time_column: 1
  value_column: 2

nino3:
  filename: 'wtc_test_data_nino_even'
  file_extension: 'csv'
  pandas_kwargs:
    header: 0
  pyleo_kwargs:
    time_name: 'time'
    time_unit: 'year C.E.'
    value_name: 'NINO3'
    value_unit: '$^{\circ}$C'
  time_column: t
  value_column: nino

Overall, your strategy works well and is very scalable, so I recommend going forward with the other datasets in the list. We might want to add ODP846 and EPICA Dome C delta D. To see how metadata should be specified for the latter, see this notebook. For ODP846, see this notebook

@kcpevey
Copy link
Contributor Author

kcpevey commented Feb 2, 2023

@CommonClimate I chose the metadata.yml because it keeps all the loading information in a separate place (with the data) rather than in the code. I could get rid of the metadata yml, and add an if statement for each dataset within the load_dataset function itself. That approach also has merit, but it could also lead to a very bloated load_dataset function if you have a bunch of examples. I'm happy to go either direction with this which is why I only put two datasets in and was hoping for feedback before I moved forward. Which way would you prefer?

@khider
Copy link
Member

khider commented Feb 2, 2023

Let's keep the .yml file. It will be easier to re-edit later on.

@CommonClimate
Copy link
Collaborator

I agree, let's keep metadata.yml.

@CommonClimate
Copy link
Collaborator

Question on this: I'd like to write a unit test for the to_csv()/from_csv() round-trip that loads available datasets and compares them, to make sure we are getting back what we put it for each of them. Would I be able to do so with a statement like:
@pytest.mark.parametrize('dataset', pyleo.utils.datasets.available_dataset_names())

@CommonClimate
Copy link
Collaborator

By the way, we could re-use from_csv() within load_dataset(), if it makes sense.

@kcpevey
Copy link
Contributor Author

kcpevey commented Feb 3, 2023

Question on this: I'd like to write a unit test for the to_csv()/from_csv() round-trip that loads available datasets and compares them, to make sure we are getting back what we put it for each of them. Would I be able to do so with a statement like:
@pytest.mark.parametrize('dataset', pyleo.utils.datasets.available_dataset_names())

Yes, that will work

By the way, we could re-use from_csv() within load_dataset(), if it makes sense.

That seems like it might be more efficient. I was just going from an example I found in a notebook and didn't realize it was an option.

@CommonClimate
Copy link
Collaborator

No fault of ours! That feature didn't exist until a few commits ago! I would welcome feedback on how from_csv() is implemented, however, as it is a little kludgy.

@kcpevey
Copy link
Contributor Author

kcpevey commented Feb 7, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants