Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize data version control for managing test images #1036

Merged
merged 24 commits into from
Mar 18, 2021

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Mar 11, 2021

Description of proposed changes

Using a data version control package called dvc to manage the PNG test images in our repository!

In a nutshell, we will only store the hash of the PNG on GitHub (in a *.png.dvc file) and the actual PNG will be stored at https://dagshub.com/GenericMappingTools/pygmt (see e.g. https://dagshub.com/GenericMappingTools/pygmt/src/data_version_control/pygmt/tests/baseline/test_logo_on_a_map.png).

image

List of commands ran so far on this 'data_version_control' branch (will update as I test things out):

Background (only needed to be done once for this repo)

Installing DVC for developing PyGMT

cd ~/path/to/pygmt
conda activate pygmt
make install
pip install dvc

DVC init

dvc init # creates .dvcignore file and .dvc/ folder
# remove .dvc/plots folder as won't be used
# git add only the .dvcignore, .dvc/.gitignore and .dvc/config file
git add .dvcignore .dvc/.gitignore .dvc/config
git commit -m "Initialize data version control"

Setup DVC remote

dvc remote add upstream https://dagshub.com/GenericMappingTools/pygmt.dvc # updates .dvc/config file with remote URL
dvc remote default upstream  # set default dvc remote to 'upstream'

When a new test is being written (Everyone will need to run)

DVC remote authentication config (changes .dvc/config.local)

dvc remote modify upstream --local auth basic
dvc remote modify upstream --local user "$DAGSHUB_USER"
dvc remote modify upstream --local password "$DAGSHUB_PASS"  # generate token at https://dagshub.com/user/settings/tokens

Generate new baseline PNG images

pytest --mpl-generate-path=baseline pygmt/tests/test_logo.py  # generate new baseline images
mv baseline/*.png pygmt/tests/baseline/
dvc add pygmt/tests/baseline/test_logo.png pygmt/tests/baseline/test_logo_on_a_map.png
git rm -r --cached 'pygmt/tests/baseline/test_logo.png'
git rm -r --cached 'pygmt/tests/baseline/test_logo_on_a_map.png'
# echo "/test_*.png" > pygmt/tests/baseline/.gitignore

Push images to DVC remote

dvc add pygmt/tests/baseline/test_logo.png pygmt/tests/baseline/test_logo_on_a_map.png
git add pygmt/tests/baseline/.gitignore pygmt/tests/baseline/test_*.png.dvc
git commit -m "Add test_logo_*.png into DVC"

dvc push --remote upstream
git push

Pull PNG images from DVC remote (needed for Github Actions CI)

dvc status  # should report any files 'not_in_cache'
dvc pull  # pull down files from DVC remote cache (fetch + checkout)
pytest pygmt/tests/test_logo.py  # run pytest as per usual

Note: Some of the dvc commands may not be necessary if we dvc install some git hooks so that dvc add/checkout/push actions happen simply with git add/checkout/push, but need to try this out a bit more. One forseeable con with this solution is that dvc may (?) run even for commits where no PNG images change, so an extra layer of slowness.

References:

Addresses #963

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If adding new functionality, add an example to docstrings or tutorials.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

  • /format: automatically format and lint the code
  • /test-gmt-dev: run full tests on the latest GMT development version

Adding dvc package to environment.yml and
running `dvc init` to get the barebones
.dvcignore, .dvc/config & .dvc/.gitignore files.
@weiji14 weiji14 added the maintenance Boring but important stuff for the core devs label Mar 11, 2021
@weiji14 weiji14 added this to the 0.4.0 milestone Mar 11, 2021
@weiji14 weiji14 force-pushed the data_version_control branch from ea5444a to 7e0940c Compare March 12, 2021 05:19
@willschlitzer
Copy link
Contributor

Should this update requirements-dev.txt to include dvc?

@weiji14
Copy link
Member Author

weiji14 commented Mar 13, 2021

Should this update requirements-dev.txt to include dvc?

We don't have a requirements-dev.txt file anymore as of #812 😄. I wanted to put it in environment.yml but ran into some CI issues, might need to finish off #1033 first.

Oh, and will need to document how to use dvc somewhere too. I'm still getting my head around how to make it as easy as possible for everyone.

@weiji14 weiji14 changed the title WIP: Initialize data version control Initialize data version control for managing test images Mar 15, 2021
@weiji14
Copy link
Member Author

weiji14 commented Mar 15, 2021

Ok, I've added a "Using data version control (dvc)" subsection to CONTRIBUTING.md at 6bd7ba9. Have a read, see if it makes sense, and maybe try and test this Pull Request out! Feel free to ask any questions too.

environment.yml Outdated Show resolved Hide resolved
@weiji14
Copy link
Member Author

weiji14 commented Mar 17, 2021

Sorry, had to restart the regular GMT6.1.1/Python 3.9 test, dvc push didn't work because I had my credentials on my laptop instead of my uni computer 😅 Edit: Double oops, I forgot to switch back from GMT 6.2.0 to GMT 6.1.1 in my pygmt conda env so the baseline images are wrong, hold on...

I tried triggering the GMT dev version test at #1036 (comment), and looks like test_logo*.py failed (see https://github.com/GenericMappingTools/pygmt/actions/runs/662613609) because dvc pull wasn't run. Actually, the workflow run did not have the dvc pull step at all, and it seems like the changes at f3aa3c5 were not checked out, perhaps a security feature to prevent people changing the Github Actions yaml workflow in a Pull Request from doing unintended things?

@weiji14 weiji14 force-pushed the data_version_control branch from d56a31e to c37bdff Compare March 17, 2021 23:05
@seisman
Copy link
Member

seisman commented Mar 18, 2021

I saw you just migrated test_image.png to dvc. We also need to update the test_image.py script:

  • Use SI units
  • Use long-form aliases, not single-character letters.

@seisman
Copy link
Member

seisman commented Mar 18, 2021

My last concern is, how can reviewers review the image changes?

When someone runs dvc push and opens a PR, can we see the images in DAGsHUB, regardless of the 8-hr mirror interval? Do we have to checkout the branch and run dvc pull to download the images?

@weiji14
Copy link
Member Author

weiji14 commented Mar 18, 2021

My last concern is, how can reviewers review the image changes?

When someone runs dvc push and opens a PR, can we see the images in DAGsHUB, regardless of the 8-hr mirror interval? Do we have to checkout the branch and run dvc pull to download the images?

No need to do a dvc pull locally. You can view the new images directly from the Pull Request branch on DAGsHub. E.g. for this 'data_version_control' branch, go to https://dagshub.com/GenericMappingTools/pygmt/src/data_version_control/pygmt/tests/baseline/test_logo_on_a_map.png and just see the image there:

image

@seisman
Copy link
Member

seisman commented Mar 18, 2021

I just added a new image to my testing branch (commit b8bdb7c in #1071). However, I can't preview the image in DAGsHub: https://dagshub.com/GenericMappingTools/pygmt/src/fix-test-basemap-polar/pygmt/tests/baseline

@seisman
Copy link
Member

seisman commented Mar 18, 2021

We may also need to update the maintenance guides about how to review PRs with dvc images. We can do it after we have more experience with the new workflow.

@weiji14 weiji14 marked this pull request as ready for review March 18, 2021 03:30
@weiji14
Copy link
Member Author

weiji14 commented Mar 18, 2021

Will let the full tests run before merging 🚀, thanks for the review and testing things out thoroughly!

We may also need to update the maintenance guides about how to review PRs with dvc images. We can do it after we have more experience with the new workflow.

Yep, will do that in a follow up PR. I'll post a comment on #963 to summarize the new workflow for everyone else on the team.

@weiji14 weiji14 merged commit 1b74d4d into master Mar 18, 2021
@weiji14 weiji14 deleted the data_version_control branch March 18, 2021 03:57
@core-man
Copy link
Member

@weiji14 Thank you so much. This is a really amazing improvement. I already used it in #1065.

sixy6e pushed a commit to sixy6e/pygmt that referenced this pull request Dec 21, 2022
…ingTools#1036)

Using a data version control package called
[`dvc`](https://github.com/iterative/dvc)
to manage the PNG test images in the PyGMT repo!

In a nutshell, store only the hash of the PNG
on GitHub (in a *.png.dvc file), while having
the actual PNG stored on DAGsHub at
https://dagshub.com/GenericMappingTools/pygmt.

* Initialize data version control

Adding dvc package to environment.yml and
running `dvc init` to get the barebones
.dvcignore, .dvc/config & .dvc/.gitignore files.

* Set dvc remote as https://dagshub.com/GenericMappingTools/pygmt.dvc
* Temporarily installing dvc using pip instead of conda to make CI work
* Refactor test_logo to use mpl_image_compare and track png files in dvc
* Add dvc pull as a step in ci_tests.yaml to pull in data
* List files in pygmt/tests/baseline/ to see what happens after dvc pull
* Do `dvc pull` before `pip install dist/*` otherwise test PNGs aren't there
* First draft of instructions for using dvc to store baseline images

* Instruct to do `git push` first and then `dvc push`

Technically the order shouldn't matter, but most
tutorials seem to use `git push` first so follow that.

* New checklist item for maintainers to get added to DAGsHub dvc remote
* Move pygmt/tests/baseline/.gitignore to top-level
* Clarify that `git rm -r --cached` only needs to run during migration
* Try installing dvc from conda again now that there is a Py3.9 package

* Install dvc and do `dvc pull` on GMT dev tests too
* Refactor test_logo tests to be simpler and more unit-test like
* Mention dvc status command to see which files need staging
* Update test_image to use SI units and long aliases

Co-authored-by: Dongdong Tian <seisman.info@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants