Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable download of large (spatial extent) cutouts from ERA5 via cdsapi. #236

Merged
merged 19 commits into from
Apr 5, 2023

Conversation

euronion
Copy link
Collaborator

@euronion euronion commented May 16, 2022

Closes #221 .

Change proposed in this Pull Request

Split download of ERA5 into monthly downloads (currently: annual downloads) to prevent too-large downloads from ERA5 CDSAPI.

TODO

  • Add month indicator to progress prompts.

Description

Motivation and Context

See #221 .

How Has This Been Tested?

Locally by downloading a large cutout.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • [n/a] Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I tested my contribution locally and it seems to work fine.
  • I locally ran pytest inside the repository and no unexpected problems came up.
  • I have adjusted the docstrings in the code appropriately.
  • I have documented the effects of my code changes in the documentation doc/.
  • [n/a] I have added newly introduced dependencies to environment.yaml file.
  • I have added a note to release notes doc/release_notes.rst.
  • I have used pre-commit run --all to lint/format/check my contribution

@fneum
Copy link
Member

fneum commented May 17, 2022

How does that interact with queuing at CDSAPI? Does that increase the chances of getting stuck in the request in month 9 or so?

@euronion
Copy link
Collaborator Author

I don't know.

The downloads for the larger cutouts worked relatively smoothly (1-2 hours), but the number of requests is 12x higher for a normal year, so the chances might be higher. On the other hand, since the downloaded slices are smaller I would not expect major performance changes. Probably acceptable, since you're not downloading cutouts on an everyday basis.

I don't know enough about the internals of the ERA5 climate store and I don't think we should optimise our retrieval routines for it as long as we haven't received any complaints for bad performance.

@euronion
Copy link
Collaborator Author

Alright. I did not encounter any issues downloading large datasets. Seems to work nicely @FabianHofmann .

What would be helpful is a message indicating which month/year combination is currently being downloaded, do you have an idea on how to easily implement this @FabianHofmann ?

Then I'd suggest @davide-f tries to download his cutout as well and if that works without issues then we can merge.

@euronion euronion marked this pull request as ready for review May 31, 2022 12:08
@davide-f
Copy link
Contributor

@euronion Super! thank you very much. Currently, I am a bit busy with other stuff and I cannot run the machine with copernicus waiting long time for the analysis, unfortunately. As I have free resources, I'll test that.
Thank you!

@FabianHofmann
Copy link
Contributor

Great. For the logging I would suggest to go with e.g. "2013-01", instead of "2013" only.
See

yearstr = ", ".join(atleast_1d(request["year"]))

which could be changed into

timestr = f"{request["year"])}-{request["month"]}"

and changed replaced accordingly in

varstr = "".join(["\t * " + v + f" ({yearstr})\n" for v in variables])

@davide-f
Copy link
Contributor

As discussed with @euronion, I'll wait for his latest updates by the end of the week (estimate), and I'll run the model for the entire world.

As a comment, the "number of slices", currently one a month, may be a parameter as well.
Anyway, we could keep the current implementation and see if it works for the world, fingers crossed.

@euronion
Copy link
Collaborator Author

@davide-f You're good to give it a try!

Regarding your comment:
I had a look at the code and if I get the intention behind the comment correct (optimising the retrieval) then it might be easier to implement a heuristic which calculates the number of points being retrieved (np.prod([len(v) for k,v in request.items()])) and adjusts it automatically such that the request will safely not fail (request size below the size at which CDSAPI breaks) than to have a parameter to adjust it.

If it works for you @davide-f and the time it takes is acceptable (please report it as well if you can) then I'd stay away from overoptimising this aspect and just keep the monthly retrieval.

@davide-f
Copy link
Contributor

@euronion the branch is running :) I'll track it and update you as I have news.
Just as a comment, I had to to few tests that have been interrupted, hence, since copernicus reduce priority to users' requests the more the same user is using the service, that may lead to a slight overestimation of the total expected time, though I don't think it is an issue.

I totally agree on seeing if the monthly retrieval works fine and it's expected time. I fear that it may take very long times though. I'll notify you as I have news :)

@davide-f
Copy link
Contributor

I confirm that the first 1-month chunk has been downloaded. I'll be waiting for the entire procedure to end and let you know :)

@davide-f
Copy link
Contributor

davide-f commented Jun 20, 2022

@euronion The procedure for the world (+- 180° lat lon) completed in 5 to 12 hours (I run it twice) successfully and produced an output file of 380Gb (large but we are speaking of a lot of data), see the settings below.

atlite:
  nprocesses: 4
  cutouts:
    # geographical bounds automatically determined from countries input
    world-2013-era5:
      module: era5
      dx: 0.3  # cutout resolution
      dy: 0.3  # cutout resolution
      # Below customization options are dealt in an automated way depending on
      # the snapshots and the selected countries. See 'build_cutout.py'
      time: ["2013-01-01", "2014-01-01"]  # specify different weather year (~40 years available)
      x: [-180., 180.]  # manual set cutout range
      y: [-180., 180.]    # manual set cutout range

As a recommendation, to silence some warning, if interested, the following comment was risen:

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]
/home/davidef/miniconda3/envs/pypsa-africa/lib/python3.10/site-packages/xarray/core/indexing.py:1228: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

The output also makes sense, however, it has some weird white bands, though I don't think this is related to this PR, what do you think?
output

@davide-f
Copy link
Contributor

As discussed, for efficiency purposes, it may be interesting to decide the number of chunks to divide the output.
Since at world scale worked, we could specify the number of chunks as a number between 1 and 12, and we divide the blocks by months, e.g. 4 chunks: months 1-3, 4-6, 7-9 and 10-12.
For small data to downloading, it may be more efficient to download everything in one go; for Africa or Europe for example there is no need to split the data; yet this is a detail as long as it works

@euronion
Copy link
Collaborator Author

euronion commented Jul 15, 2022

  • Think about heuristic to download in smaller/larger chunks depending on data geographical scope to download
  • Add note to documentation on how to compress cutouts

I attempted to compress cutouts during/after creation but without much success. using zlib integration of xarray the compressed cutouts unfortunately always increased in size (rather than decreasing). Using native netCDF tools compression of cutouts to 30-50% of size is possible without impacts on atlite performance. I want to add notes on this to the documentation with this PR as this allows for larger cutouts.

I would have preferred a solution where compression is done by atlite directly, but it seems like that does not work well using xarray.

@codecov-commenter
Copy link

codecov-commenter commented Sep 6, 2022

Codecov Report

Patch coverage: 91.66% and project coverage change: -0.09 ⚠️

Comparison is base (f9bd7fd) 72.83% compared to head (d9f3bff) 72.74%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
- Coverage   72.83%   72.74%   -0.09%     
==========================================
  Files          19       19              
  Lines        1590     1596       +6     
  Branches      227      270      +43     
==========================================
+ Hits         1158     1161       +3     
- Misses        362      363       +1     
- Partials       70       72       +2     
Impacted Files Coverage Δ
atlite/datasets/era5.py 88.23% <88.88%> (-1.70%) ⬇️
atlite/data.py 86.36% <100.00%> (+0.31%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@euronion
Copy link
Collaborator Author

euronion commented Sep 6, 2022

@davide-f If you wish to reduce the file size you can follow the instructions in the updated doc:

https://github.com/PyPSA/atlite/blob/230aa8a5b1b21bff8f03d23631f01e6ebf5d83b3/examples/create_cutout.ipynb

Should save ~50% :)

@euronion
Copy link
Collaborator Author

euronion commented Sep 6, 2022

Month indicator has been added, e.g. info prompt during creation looks like this to indicate the month currently being retrieved

2022-09-06 14:14:27,779 INFO CDS: Downloading variables
         * runoff (2012-12)

@euronion
Copy link
Collaborator Author

euronion commented Sep 6, 2022

I suggest we offload the heuristic into a separate issue and tackle it if necessary. ATM I think it would be a nice but unnecessary feature.

@euronion
Copy link
Collaborator Author

euronion commented Sep 6, 2022

RTR @FabianHofmann would you?

Copy link
Member

@fneum fneum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested by @nworbmot

@euronion
Copy link
Collaborator Author

euronion commented Apr 4, 2023

No idea why the CI keeps failing (no issues locally) and why it is continuing the old CI.yaml with Python 3.8 instead of 3.11

@euronion euronion merged commit 3a6f543 into master Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Impossible to download Earth-scale data
5 participants