-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dramatically different download speeds between versions #371
Comments
More testing: I identified CDS logs tell me that the main difference is that My guess is that the new approach is penalized by the CDS scheduler, putting jobs back in the queue instead of just executing them. |
One last test with different features ( I am not familiar enough with CDS to know if avoiding this one is possible, though. |
great initiative @irm-codebase! so can we narrow it down to the way we are doing the feature requests with the CDS package? |
@FabianHofmann thanks! Also, I keep seeing a message to "move to CDS-Beta", so I'd also check if this behavior also applies there. |
yes, the beta version is an important upcoming step which we have to take in near future. If you are really motivated, you could also have a look at this and the (hopefully) performance improvements. also note one more thing. CDS keeps files that were recently downloaded "warm" for fast re-downloading. So, whenever you already downloaded a feature recently, it would be much faster. |
For reference, this is the PR where the monthly chunking was introduced (#236) and I think this is the one where requests were split by features (#86). The CDS is undergoing a significant migration at the moment, which has resulted in throttled requests. The new CDS-BETA is out for a couple of weeks and the latest Will also do some investigation in the following days. |
Thanks for giving this priority! I did test with the same dataset every time. A successful download (with older versions) did not seem to help the newer one, but I did not test this thoroughly. |
I benchmarked a similar setup in the new CDS infrastructure (https://cds-beta.climate.copernicus.eu/). The old one will be shut down in September. The major aspect that changed between the import atlite
# annual request
cutout = atlite.Cutout(
path="cutout-monthly.nc",
module="era5",
xs=slice(-10, -5),
ys=slice(35, 40),
time=slice("2015-01", "2015-12")
)
cutout.prepare(["wind"], monthly_requests=False)
# completed in 35 minutes
# sequential monthly request
cutout = atlite.Cutout(
path="cutout-annual.nc",
module="era5",
xs=slice(0, 5),
ys=slice(40, 45),
time=slice("2015-01", "2015-12")
)
cutout.prepare(["wind"], monthly_requests=True)
# completed in 49 minutes It indeed seems that reverting to annual requests was faster in the test cases and may make sense for cutouts with a smaller geographical extent; though the exact trade-off is unclear. One thing I noticed is that the monthly requests are made sequentially rather than in parallel. I don't think this used to be like that in the past and could add to the queuing times. Will test next if we can do parallel requests and if this achieves a speed-up. LogsHere are the request logs from https://cds-beta.climate.copernicus.eu/requests?tab=all (the last item is the annual request): |
Using In atlite/atlite/datasets/era5.py Lines 451 to 453 in d04aff0
I changed: - datasets = map(
- retrieve_once, retrieval_times(coords, monthly_requests=monthly_requests)
- )
+ time_chunks = retrieval_times(coords, monthly_requests=monthly_requests)
+ delayed_datasets = [delayed(retrieve_once)(chunk) for chunk in time_chunks]
+ datasets = compute(*delayed_datasets) xref: 1866fc3 I tested similar to the cases above: # concurrent monthly request
cutout = atlite.Cutout(
path="cutout-monthly-parallel.nc",
module="era5",
xs=slice(-5, 0),
ys=slice(45, 50),
time=slice("2015-01", "2015-12")
)
cutout.prepare(["wind"], monthly_requests=True)
# completed in 23 minutes With 23 minutes queueing time, this is even faster. However, it should be noted that the overall queue could have had different lengths and, therefore, the numbers can only be indicative. The concurrent time chunk requests should be optional though, as too many parallel requests may get you throttled (https://confluence.ecmwf.int/display/CEMS/CDS+-+Best+Practices):
This might especially happen if you 5 different features, which already uses Logs |
Final follow-up (and with that, I'm quite happy with the solutions proposed in #372). I tried a bulk concurrent submission with 50 simultaneous requests for different features: cutout = atlite.Cutout(
path="cutout-annual-parallel-features.nc",
module="era5",
xs=slice(0, 5),
ys=slice(45, 50),
time=slice("2015-01", "2015-12")
)
cutout.prepare(
["height", "wind", "influx", "temperature", "runoff"],
monthly_requests=True,
concurrent_requests=True
) This one finished in 26 minutes, so it took a comparable time to downloading a single feature concurrently. |
Thank you for checking it so fast! @fneum does this mean I won't experience a slowdown if I want to download several individual cutouts in tandem? Some of my scripts follow this approach, and it would be great if I do not have to modify the order (e.g., first download a big dataset, then process). |
I think that's right. |
Further speedups can be achieved by
|
Version Checks (indicate both or one)
I have confirmed this bug exists on the lastest release of Atlite.
I have confirmed this bug exists on the current
master
branch of Atlite.Issue Description
It seems like older
atlite
versions achieved faster downloads... somehow?I've been comparing download speeds between
atlite=0.2.1
andatilite=0.2.13
, and the former consistently beats out the latter when it comes to download speeds (tested around half a dozen times).The difference is quite dramatic: minutes to hours.
Reproducible Example
Expected Behavior
Download speeds should generally improve between versions, or remain unchanged.
Installed Versions
old: 0.2.1
new: 0.2.13
The text was updated successfully, but these errors were encountered: