-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
downloading dataset took longer than specified #10
Comments
Oh yeah, I mentioned at #3 (comment) that I forgot to update that comment after commit a904131 😅 It took about 793min or 13.2 hours for me to download 18.2GB of data. You're correct that the rechunking makes things slower. It might actually be faster to download the Zarr store first, and then rechunk it, but that would require some knowledge of xarray. |
If you do want to try though. Steps are
foss4g2023oceania/0_weatherbench2zarr.py Lines 41 to 45 in 04e87ab
store_name: str = "2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr"
ds: xr.Dataset = xr.open_dataset(
filename_or_obj=store_name, engine="zarr", chunks="auto", consolidated=True,
)
ds_rechunked: xr.Dataset = ds_500hpa_zuv.chunk(
time=1,
latitude=len(ds.latitude),
longitude=len(ds.longitude),
)
ds_rechunked.to_zarr(store="2020-full_37-6h-0p25deg-chunk-1_zuv500_rechunked.zarr", consolidated=True, zarr_version=2) |
how big is the full .zarr datastore to save it on disk first? |
I got this error after trying to save the dataset first
|
The rechunked Zarr store should be 18.2GB, the original one should be about the same I think, I don't think the chunk sizes change the disk usage too much
Hmm, could you try using ds_rechunked: xr.Dataset = ds_500hpa_zuv.unify_chunks()
ds_rechunked.to_zarr(store=store_name, consolidated=True, zarr_version=2) |
I tried running this on google colab, I can see it's running at 1gb/s full speed downloading, and it's still taking long time to download, already ran for like 1:30 hour, 18GB shouldn't take this long, something must be wrong. is it trying to read thru the whole dataset (which might be in TB) to get this subset of 18GB? update: finally completed running after almost 3 hours, below is the du -h output, it's 65G in total 20K 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/longitude downloading 65G shouldn't take 3 hours on colab google datacenter fast pipe, 1g/s @ 3hours = 3 *60 *60 *100mb ~= 1TB, why are we downloading 1TB of data to get 65G of sub dataset? |
Glad you got it running! I would be surprised if we did download 1TB of data 😅 To be honest, I was rushing to prepare the presentation, and really just needed some sample data to run the benchmarks, so didn't focus on making that part faster. The download script could definitely be a lot more optimized I'm sure. |
running 'python 0_weatherbench2zarr.py' for almost 12 hours, and 'du -h' on the zarr directory, showed it downloaded only 2.2GB of data.
I can see the internet is running full at around 150mb/s, it could have downloaded like 500GB for this time taken.
is there something wrong with the chunking that it's taking so much bandwidth and taking so long?
I'm not familiar with xarray, is this expected to take this long?
from the python code comment it said
# Save to Zarr with chunks of size 1 along time dimension
# Can take about 1 hour to save 10.7GB of data at 40MB/s
but it's taken more time than this and more bandwidth
edit: thought this might help, these warning below were shown when running
/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "level" starting at index 34. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(
/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "latitude" starting at index 701. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(
/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "longitude" starting at index 1404. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(
The text was updated successfully, but these errors were encountered: