Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync testdata folder to THREDDS testdata/raven #185

Closed
huard opened this issue Dec 6, 2019 · 12 comments
Closed

Sync testdata folder to THREDDS testdata/raven #185

huard opened this issue Dec 6, 2019 · 12 comments
Assignees

Comments

@huard
Copy link
Contributor

huard commented Dec 6, 2019

So we can run tests entirely on the platform.

@huard
Copy link
Contributor Author

huard commented Aug 26, 2020

Related: Ouranosinc/xclim#525
We could use the same strategy of creating a stand-alone repo for test data. Might make the sync process cleaner.

@tlvu
Copy link
Contributor

tlvu commented Sep 21, 2020

I am trying to wrap my head around what's the real root problem here.

Basically we do not want tests to use relative path to the testdata folder in this repo. How about just turn that relative path to an http path directly to github (ex: https://github.com/Ouranosinc/raven/raw/master/tests/testdata/hydro_simulations/raven-gr4j-cemaneige-sim_gr4jcn-0_Hydrographs.nc)? This way, no other server is needed and it makes the .ipynb file standalone?

For optimization, we can detect if the relative path exist (dev full checkout mode), use that one and keep the http path as fallback only (tutorial mode).

I rather our Thredds server do not become a single point of failure for running tests.

@huard
Copy link
Contributor Author

huard commented Sep 21, 2020

  1. I think we want to split the test data from the code. This would mean creating a raven-testdata repo.
  2. Some of the tests and tutorials should exercise DAP URLs, so for some tests, we'll need a THREDDS server.

One idea is to have a github raven-testdata repo that gets synced on THREDDS when there is a release of raven-testdata. Then, we'll need a client API to fetch data either from THREDDS or from github.

@tlvu
Copy link
Contributor

tlvu commented Sep 21, 2020

2. Some of the tests and tutorials should exercise DAP URLs, so for some tests, we'll need a THREDDS server.

Maybe Raven spawn its own Thredds server, like Emu does https://github.com/bird-house/emu/blob/56def2684fc28fee09089382de192075f065f3f2/docker-compose.yml#L12-L23 ?

Again I don't want all the tests to suddenly fail on Travis-CI and all local dev workstation just because Thredds is down for maintenance. And let's say there is a new or updated dataset that is not yet synced to Thredds, how can a dev continue his work?

So my point is yes we'll still need to sync all the data to Thredds for tutorials but day-to-day dev workflow should not rely on Thredds.

Which test(s) need a DAP link right now? Some data is already manually on Thredds?

@tlvu
Copy link
Contributor

tlvu commented Sep 21, 2020

Can Intake provide what we need "abstract the backend storage (local file, http link, dap link)"?

@huard
Copy link
Contributor Author

huard commented Sep 21, 2020

I'm wary to make the development environment more complex than it is, but I think you raise valid issues.
I agree on the need to split the test environment (stand-alone) and the tutorial environment (connected to existing data on THREDDS).

Intake: possibly. There could be a field access_type taking values of http or dap that we could filter on. Or two different catalogs.

@tlvu
Copy link
Contributor

tlvu commented Oct 15, 2020

One idea is to have a github raven-testdata repo that gets synced on THREDDS when there is a release of raven-testdata. Then, we'll need a client API to fetch data either from THREDDS or from github.

Analysis:
Ouranosinc/xclim-testdata#1 (comment)

Conclusion
Ouranosinc/xclim-testdata#1 (comment)

@tlvu
Copy link
Contributor

tlvu commented Oct 15, 2020

@huard Houston, we have a problem.

Looking at

TESTDATA["raven-mohyse-rv"] = tuple(
(TD / "raven-mohyse").glob("raven-mohyse-salmon.rv?")
)

There are many other file types than .nc so synching .nc files to Thredds will not solve the entire problem. Unless you tell me Thredds can also handle .rvt, .gml, .zip, .gpkg, .tiff. .csv files.

I see 2 possible solutions to avoid having to clone the entire Raven repo for tutorial notebooks:

1 - sync only the testdata folder together with the tutorial notebooks, so we avoid synching the entire repo, and in that example_data.py file we add a fallback, "if not available at the usual location, search for a folder raven-testdata in the same folder".

2 - add a fallback to direct http raw file on github (ex: https://github.com/Ouranosinc/raven/raw/master/tests/testdata/gr4j_cemaneige/evap.nc). This route will imply hardcoding each and every testdata file since the glob trick on local filesystem do not work anymore and also means no deletion or modifying existing testdata, else old revisions of the notebooks would break. The upside to this option is each .ipynb will only need example_data.py next to it, not the entire raven-testdata/ folder. But it is still not 100% standalone. To be 100% standalone, we need to duplicate the logic of example_data.py inside each .ipynb file, not sure it's a good idea either but it's an option if we really want 100% standalone .ipynb files.

I would favor option 1 unless you have a 3rd option to suggestion or you prefer option 2 and can live with the limitations.

@huard
Copy link
Contributor Author

huard commented Oct 15, 2020

Suggestion:

  • Sync only netCDF files to THREDDS.
  • Create a function raven.tutorial.get_file that knows how to handle different file types. In the case of netCDF either returns an http or dap link from thredds, and for other files return the rawgithub link.

@tlvu
Copy link
Contributor

tlvu commented Oct 15, 2020

  • Sync only netCDF files to THREDD

Already done.

  • Create a function raven.tutorial.get_file that knows how to handle different file types. In the case of netCDF either returns an http or dap link from thredds, and for other files return the rawgithub link.

Just to be sure, this is a fallback only when the full checkout is not there. On dev workstation and Travis-CI, the full checkout will be there so I'd rather not force external dependencies when everything is available locally. I don't want dev unable to run tests and Travis-CI fail just because our Thredds is on maintenance mode.

Where would you want this raven.tutorial.get_file function? It will ship part of Raven? If a new testdata file is added or existing renamed, we will need to release a new Raven? Or in the same old example_data.py and ship that example_data.py together with all the tutorial notebooks?

@huard
Copy link
Contributor Author

huard commented Oct 15, 2020

Yes, as part of Raven. I think this matches the philosophy of keeping the notebooks in sync with the code.
Yes. I don't think there is a use case for new test data file that is not explicitly used by the code.

The example_data model is really not ideal I believe. I think it's a source of user confusion (answered a question about it today...)

@tlvu
Copy link
Contributor

tlvu commented Oct 15, 2020

Yes, as part of Raven. I think this matches the philosophy of keeping the notebooks in sync with the code.

Perfect, as long as we are ready to release and deploy Raven often. The notebooks are set to auto-deploy every hour. If it requires a new Raven for new testdata, it will break.

Yes. I don't think there is a use case for new test data file that is not explicitly used by the code.

I didn't mean new testdata not used. I meant new testdata needed by notebook but not yet available on the currently deployed Raven (not up-to-date yet Raven). Not a problem if we release Raven often for deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants