Version datasets individually #18

leouieda · 2022-01-25T11:55:18Z

Instead of using the entire data bundle and versioning all datasets based on the module name, version them individually. To do so, pull from the release and DOIs in https://github.com/fatiando-data instead of https://github.com/fatiando/data This means that functions won't have to be repeated and updating 1 dataset doesn't mean copying all of the others along with it (since the collection would be new). Versions are now specified as a required version argument in all fetch_* functions.

The implementation still bundles information for all v1 datasets in a variable to make it easier to manage. When updating a dataset, entries for the others don't need to be repeated. Also only need 2 environment variables for setting the cache location and the data source (instead of 2 per version). A downside is that we can no longer accept a variable to set custom data source URL since each dataset has a different one. The new environment variable only sets fetching from GitHub or not.

Reminders:

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

leouieda · 2022-01-25T16:31:08Z

@santisoler could you take a quick look at ensaio/_fetchers.py? All the rest is updates to docs and moving things around so not much changed there.

The implementation I went for is to centralize the URLs, hashes, and DOIs in a single REGISTRY dictionary and have a function that generates a pooch.Pooch from it. I find it a bit clunky right now since Pooch is designed to fetch from single source mostly. Not sure if it would be best to put the information inside the fetch_ functions and wrap pooch.retrieve (which would mean reimplementing the cache environment variable code here).

Any thoughts would be welcome. Not crucial for merging this since it's all backend stuff that we can change later.

The registry now groups by data instead of version, which is more intuitive and keeps different versions of the same data close to each other. Had to change the implementation a bit but nothing major.

leouieda · 2022-02-18T16:06:11Z

Merging this in since @santisoler reviewed and gave feedback live in a developer meeting (see the meeting notes).

leouieda added 6 commits January 25, 2022 10:00

Start using fatiando-data as the new source

b9e5b4d

Update all mentions of the data source

22f6b52

Add tests for the new functions

6972111

Add all datasets, remove old code, and update gallery

08d9a92

Update Actions to cache the right datasets

c062401

Pin GMT and PyGMT versions

bf0de03

leouieda marked this pull request as ready for review January 25, 2022 16:24

leouieda requested a review from santisoler January 25, 2022 16:27

leouieda added 3 commits January 26, 2022 06:48

Pin GMT in the environment file as well

fb98d17

Update the README information to the new fatiando-data

0ac906e

Refactor to group by dataset instead of version

4a32ae3

The registry now groups by data instead of version, which is more intuitive and keeps different versions of the same data close to each other. Had to change the implementation a bit but nothing major.

leouieda merged commit 87db21f into main Feb 18, 2022

leouieda deleted the new-source branch February 18, 2022 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version datasets individually #18

Version datasets individually #18

leouieda commented Jan 25, 2022

leouieda commented Jan 25, 2022

leouieda commented Feb 18, 2022

Version datasets individually #18

Version datasets individually #18

Conversation

leouieda commented Jan 25, 2022

leouieda commented Jan 25, 2022

leouieda commented Feb 18, 2022