load_dataset() method #21

CommonClimate · 2023-01-21T01:13:32Z

I’d like to have Pyleoclim ship with a few key datasets that we can easily load for testing and docs, the way it’s done in Seaborn:

tips = sns.load_dataset("tips")

The datasets would be a subset of what is currently in the data directory of this repository.

The text was updated successfully, but these errors were encountered:

CommonClimate · 2023-01-21T01:14:42Z

Considerations:

which datasets to include? SOI, LR04, and ... ?
save in original format (csv, lipd) or as JSON objects?

kcpevey · 2023-01-23T14:34:14Z

save in original format (csv, lipd) or as JSON objects?

Since the files are being stored in git, ASCII files are preferred. This is because a sensible git-diff can be done. So a minor change to a csv or JSON file will cause minimal changes in the "eyes" of git, but a minor change to a binary file will always have the effect of completely rewriting the file. It just adds bulk to the git history.

That said, you probably need to test against each of the file types that the package can read from.

One way that I've done this in the past is to add a data folder in the package and put something like this in the package __init__.py:

PACKAGE_DIR = Path(__file__).parent.resolve()
DATA_DIR = PACKAGE_DIR.joinpath("data").resolve()

Now you can import the path to the data from anywhere like this:

from pyleoclim_util import DATA_DIR
df = pd.read_csv(DATA_DIR.joinpath('soi_data.csv')

It looks like you already have include_package_data=True, in your setup.py so those files will be included by default. There are other alternatives as well. You can add the data folder outside of the package, then in setup.py you'd need to add a data_files or package_data to point to them.

You'll also need to add a helper function to link between dataset name and the path to the data. I assume you'll also be loading the data inside this function.

Given that you have a few 10s of KB sample files, I think the above approach is the easiest to build and maintain. You mentioned how seaborn does something similar. They actually keep their sample data in a separate repo and also implement some smart cacheing. Their load_dataset function is here if you want to take a look. Your method may look similar.

khider · 2023-01-23T17:36:30Z

Soon the package won't be reading from LiPD directly so I would say let's concentrate on JSON/csv and make it as small as possible

CommonClimate · 2023-01-23T20:11:39Z

Agreed. For docstring examples that use lipd files, we can leverage the lipdverse via pylipd when that is operational, but for now let's focus on JSON or CSV datasets. @khider , I was thinking of having:

LR04
SOI
NINO3-AIR (for Coherency, Correlations, etc)
HadCRUT5 (for trends)

Anything else?

kcpevey · 2023-01-25T19:18:15Z

@CommonClimate can you confirm the source for those files?

LR04: https://github.com/LinkedEarth/paleoPandas/blob/main/data/LR04.csv
SOI: I'm assuming you want this one (https://github.com/LinkedEarth/paleoPandas/blob/main/data/soi_data.csv) since it had an edit 6 days ago?
NINO3-AIR: unsure, maybe oni.csv? or https://github.com/LinkedEarth/Pyleoclim_util/blob/master/example_data/wtc_test_data_nino.csv ?
HadCRUT5: 'https://www.metoffice.gov.uk/hadobs/hadcrut5/data/current/analysis/diagnostics/HadCRUT.5.0.1.0.analysis.summary_series.global.annual.csv' ?

CommonClimate · 2023-01-26T00:39:58Z

Yes to all the ones you had, and good find on the HadCRUT5 one.

Re: NINO3-AIR, sorry for being so terse; it is in fact this dataset: https://github.com/LinkedEarth/Pyleoclim_util/blob/master/example_data/wtc_test_data_nino_even.csv
I would call it by the label "NINO3-AIR", though.
Let me know if you hit any roadblocks!

CommonClimate · 2023-01-26T01:03:59Z

Just having those datasets would save us a huge amount of work, and we can always add more when we see how you've done it.

kcpevey · 2023-02-02T15:06:11Z

TODO:

add example notebook
add remaining datasets
decide on an approach

khider · 2023-02-02T18:58:26Z

We need to go through our example data folder and make some decisions on what to keep. I would say the 4 that @CommonClimate listed are a good start

CommonClimate · 2023-02-02T23:28:58Z

Yes, and Deborah and I are discussing a good example of a MultipleSeries object, which will be required to test/demo that part of the package.

CommonClimate · 2023-02-02T23:29:30Z

re: "decide on an approach", isn't this resolved now? (metadata.yml)

kcpevey · 2023-02-03T17:12:05Z

Yes, just marked it as resolved per the discussion here

kcpevey · 2023-02-03T19:41:46Z

Per the meeting today, I'll un-assign myself to this one and let USC take over :)

CommonClimate · 2023-02-10T00:27:27Z

@khider we need easily loadable examples of

MultipleSeries
EnsembleSeries

I see that the Pyleoclim repo's example_data folder currently contains a file called crystalcave_ens.json. Should this be added to our list of loadable datasets? If so, can you do that? For MultipleSeries, do we want to wait for pyLipd to be able to load the Euro2k dataset or can there be a json solution to that?

CommonClimate · 2023-02-13T22:13:38Z

Closing this as the method now exists and works well ; we can always add more datasets later.

CommonClimate · 2023-02-13T22:38:38Z

This is what we have now:
'SOI',
'NINO3',
'HadCRUT5',
'AIR',
'LR04',
'AACO2',
'nino_json'

CommonClimate · 2023-02-13T22:39:46Z

Just one question for @kcpevey : how would users know what datasets are available? Is there an easy way for them to pull info from metadata.yml?

kcpevey · 2023-02-15T15:42:17Z

I added a helper function to quickly see what datasets are available. I will add that to the documentation.

CommonClimate · 2023-02-15T22:31:05Z

Sounds perfect, thank you!
Speaking of documentation: we've decided to make ours more minimal, and pandas-like, so the usual >>> syntax works. Please let us know if you have any questions on this!

CommonClimate added the priority_high high priority issue label Jan 21, 2023

CommonClimate moved this to Todo in Pandas integration Jan 21, 2023

CommonClimate added this to Pandas integration Jan 21, 2023

kcpevey moved this from Todo to High Priority TODO in Pandas integration Jan 23, 2023

kcpevey mentioned this issue Jan 26, 2023

add load_dataset methods LinkedEarth/Pyleoclim_util#301

Merged

2 tasks

kcpevey self-assigned this Jan 26, 2023

kcpevey moved this from High Priority TODO to In Progress in Pandas integration Jan 26, 2023

kcpevey removed their assignment Feb 3, 2023

CommonClimate closed this as completed Feb 13, 2023

CommonClimate moved this from In Progress to Done in Pandas integration Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_dataset() method #21

load_dataset() method #21

CommonClimate commented Jan 21, 2023

CommonClimate commented Jan 21, 2023

kcpevey commented Jan 23, 2023 •

edited

Loading

khider commented Jan 23, 2023

CommonClimate commented Jan 23, 2023

kcpevey commented Jan 25, 2023 •

edited

Loading

CommonClimate commented Jan 26, 2023 •

edited

Loading

CommonClimate commented Jan 26, 2023

kcpevey commented Feb 2, 2023 •

edited by CommonClimate

Loading

khider commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

kcpevey commented Feb 3, 2023

kcpevey commented Feb 3, 2023

CommonClimate commented Feb 10, 2023

CommonClimate commented Feb 13, 2023

CommonClimate commented Feb 13, 2023

CommonClimate commented Feb 13, 2023

kcpevey commented Feb 15, 2023

CommonClimate commented Feb 15, 2023

load_dataset() method #21

load_dataset() method #21

Comments

CommonClimate commented Jan 21, 2023

CommonClimate commented Jan 21, 2023

kcpevey commented Jan 23, 2023 • edited Loading

khider commented Jan 23, 2023

CommonClimate commented Jan 23, 2023

kcpevey commented Jan 25, 2023 • edited Loading

CommonClimate commented Jan 26, 2023 • edited Loading

CommonClimate commented Jan 26, 2023

kcpevey commented Feb 2, 2023 • edited by CommonClimate Loading

khider commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

CommonClimate commented Feb 2, 2023

kcpevey commented Feb 3, 2023

kcpevey commented Feb 3, 2023

CommonClimate commented Feb 10, 2023

CommonClimate commented Feb 13, 2023

CommonClimate commented Feb 13, 2023

CommonClimate commented Feb 13, 2023

kcpevey commented Feb 15, 2023

CommonClimate commented Feb 15, 2023

kcpevey commented Jan 23, 2023 •

edited

Loading

kcpevey commented Jan 25, 2023 •

edited

Loading

CommonClimate commented Jan 26, 2023 •

edited

Loading

kcpevey commented Feb 2, 2023 •

edited by CommonClimate

Loading