-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_dataset() method #21
Comments
Considerations:
|
Since the files are being stored in git, ASCII files are preferred. This is because a sensible git-diff can be done. So a minor change to a csv or JSON file will cause minimal changes in the "eyes" of git, but a minor change to a binary file will always have the effect of completely rewriting the file. It just adds bulk to the git history. That said, you probably need to test against each of the file types that the package can read from. One way that I've done this in the past is to add a data folder in the package and put something like this in the package
Now you can import the path to the data from anywhere like this:
It looks like you already have You'll also need to add a helper function to link between dataset name and the path to the data. I assume you'll also be loading the data inside this function. Given that you have a few 10s of KB sample files, I think the above approach is the easiest to build and maintain. You mentioned how seaborn does something similar. They actually keep their sample data in a separate repo and also implement some smart cacheing. Their |
Soon the package won't be reading from LiPD directly so I would say let's concentrate on JSON/csv and make it as small as possible |
Agreed. For docstring examples that use lipd files, we can leverage the lipdverse via
Anything else? |
@CommonClimate can you confirm the source for those files?
|
Yes to all the ones you had, and good find on the HadCRUT5 one. Re: NINO3-AIR, sorry for being so terse; it is in fact this dataset: https://github.com/LinkedEarth/Pyleoclim_util/blob/master/example_data/wtc_test_data_nino_even.csv |
Just having those datasets would save us a huge amount of work, and we can always add more when we see how you've done it. |
TODO:
|
We need to go through our example data folder and make some decisions on what to keep. I would say the 4 that @CommonClimate listed are a good start |
Yes, and Deborah and I are discussing a good example of a MultipleSeries object, which will be required to test/demo that part of the package. |
re: "decide on an approach", isn't this resolved now? (metadata.yml) |
Yes, just marked it as resolved per the discussion here |
Per the meeting today, I'll un-assign myself to this one and let USC take over :) |
@khider we need easily loadable examples of
I see that the Pyleoclim repo's example_data folder currently contains a file called |
Closing this as the method now exists and works well ; we can always add more datasets later. |
This is what we have now: |
Just one question for @kcpevey : how would users know what datasets are available? Is there an easy way for them to pull info from metadata.yml? |
I added a helper function to quickly see what datasets are available. I will add that to the documentation. |
Sounds perfect, thank you! |
I’d like to have Pyleoclim ship with a few key datasets that we can easily load for testing and docs, the way it’s done in Seaborn:
tips = sns.load_dataset("tips")
The datasets would be a subset of what is currently in the
data
directory of this repository.The text was updated successfully, but these errors were encountered: