Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset() method #21

Closed
CommonClimate opened this issue Jan 21, 2023 · 19 comments
Closed

load_dataset() method #21

CommonClimate opened this issue Jan 21, 2023 · 19 comments
Labels
priority_high high priority issue

Comments

@CommonClimate
Copy link
Contributor

I’d like to have Pyleoclim ship with a few key datasets that we can easily load for testing and docs, the way it’s done in Seaborn:

tips = sns.load_dataset("tips")

The datasets would be a subset of what is currently in the data directory of this repository.

@CommonClimate CommonClimate added the priority_high high priority issue label Jan 21, 2023
@CommonClimate
Copy link
Contributor Author

Considerations:

  • which datasets to include? SOI, LR04, and ... ?
  • save in original format (csv, lipd) or as JSON objects?

@kcpevey
Copy link
Collaborator

kcpevey commented Jan 23, 2023

save in original format (csv, lipd) or as JSON objects?

Since the files are being stored in git, ASCII files are preferred. This is because a sensible git-diff can be done. So a minor change to a csv or JSON file will cause minimal changes in the "eyes" of git, but a minor change to a binary file will always have the effect of completely rewriting the file. It just adds bulk to the git history.

That said, you probably need to test against each of the file types that the package can read from.

One way that I've done this in the past is to add a data folder in the package and put something like this in the package __init__.py:

PACKAGE_DIR = Path(__file__).parent.resolve()
DATA_DIR = PACKAGE_DIR.joinpath("data").resolve()

Now you can import the path to the data from anywhere like this:

from pyleoclim_util import DATA_DIR
df = pd.read_csv(DATA_DIR.joinpath('soi_data.csv')

It looks like you already have include_package_data=True, in your setup.py so those files will be included by default. There are other alternatives as well. You can add the data folder outside of the package, then in setup.py you'd need to add a data_files or package_data to point to them.

You'll also need to add a helper function to link between dataset name and the path to the data. I assume you'll also be loading the data inside this function.

Given that you have a few 10s of KB sample files, I think the above approach is the easiest to build and maintain. You mentioned how seaborn does something similar. They actually keep their sample data in a separate repo and also implement some smart cacheing. Their load_dataset function is here if you want to take a look. Your method may look similar.

@khider
Copy link
Member

khider commented Jan 23, 2023

Soon the package won't be reading from LiPD directly so I would say let's concentrate on JSON/csv and make it as small as possible

@kcpevey kcpevey moved this from Todo to High Priority TODO in Pandas integration Jan 23, 2023
@CommonClimate
Copy link
Contributor Author

Agreed. For docstring examples that use lipd files, we can leverage the lipdverse via pylipd when that is operational, but for now let's focus on JSON or CSV datasets. @khider , I was thinking of having:

  • LR04
  • SOI
  • NINO3-AIR (for Coherency, Correlations, etc)
  • HadCRUT5 (for trends)

Anything else?

@kcpevey
Copy link
Collaborator

kcpevey commented Jan 25, 2023

@CommonClimate
Copy link
Contributor Author

CommonClimate commented Jan 26, 2023

Yes to all the ones you had, and good find on the HadCRUT5 one.

Re: NINO3-AIR, sorry for being so terse; it is in fact this dataset: https://github.com/LinkedEarth/Pyleoclim_util/blob/master/example_data/wtc_test_data_nino_even.csv
I would call it by the label "NINO3-AIR", though.
Let me know if you hit any roadblocks!

@CommonClimate
Copy link
Contributor Author

Just having those datasets would save us a huge amount of work, and we can always add more when we see how you've done it.

@kcpevey kcpevey self-assigned this Jan 26, 2023
@kcpevey kcpevey moved this from High Priority TODO to In Progress in Pandas integration Jan 26, 2023
@kcpevey
Copy link
Collaborator

kcpevey commented Feb 2, 2023

TODO:

  • add example notebook
  • add remaining datasets
  • decide on an approach

@khider
Copy link
Member

khider commented Feb 2, 2023

We need to go through our example data folder and make some decisions on what to keep. I would say the 4 that @CommonClimate listed are a good start

@CommonClimate
Copy link
Contributor Author

Yes, and Deborah and I are discussing a good example of a MultipleSeries object, which will be required to test/demo that part of the package.

@CommonClimate
Copy link
Contributor Author

re: "decide on an approach", isn't this resolved now? (metadata.yml)

@kcpevey
Copy link
Collaborator

kcpevey commented Feb 3, 2023

Yes, just marked it as resolved per the discussion here

@kcpevey
Copy link
Collaborator

kcpevey commented Feb 3, 2023

Per the meeting today, I'll un-assign myself to this one and let USC take over :)

@kcpevey kcpevey removed their assignment Feb 3, 2023
@CommonClimate
Copy link
Contributor Author

@khider we need easily loadable examples of

  • MultipleSeries
  • EnsembleSeries

I see that the Pyleoclim repo's example_data folder currently contains a file called crystalcave_ens.json. Should this be added to our list of loadable datasets? If so, can you do that? For MultipleSeries, do we want to wait for pyLipd to be able to load the Euro2k dataset or can there be a json solution to that?

@CommonClimate
Copy link
Contributor Author

Closing this as the method now exists and works well ; we can always add more datasets later.

@CommonClimate
Copy link
Contributor Author

This is what we have now:
'SOI',
'NINO3',
'HadCRUT5',
'AIR',
'LR04',
'AACO2',
'nino_json'

@CommonClimate
Copy link
Contributor Author

Just one question for @kcpevey : how would users know what datasets are available? Is there an easy way for them to pull info from metadata.yml?

@CommonClimate CommonClimate moved this from In Progress to Done in Pandas integration Feb 13, 2023
@kcpevey
Copy link
Collaborator

kcpevey commented Feb 15, 2023

I added a helper function to quickly see what datasets are available. I will add that to the documentation.

@CommonClimate
Copy link
Contributor Author

Sounds perfect, thank you!
Speaking of documentation: we've decided to make ours more minimal, and pandas-like, so the usual >>> syntax works. Please let us know if you have any questions on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority_high high priority issue
Projects
Development

No branches or pull requests

3 participants