Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downloading dataset took longer than specified #10

Open
andycjw opened this issue Feb 6, 2024 · 7 comments
Open

downloading dataset took longer than specified #10

andycjw opened this issue Feb 6, 2024 · 7 comments

Comments

@andycjw
Copy link

andycjw commented Feb 6, 2024

running 'python 0_weatherbench2zarr.py' for almost 12 hours, and 'du -h' on the zarr directory, showed it downloaded only 2.2GB of data.

I can see the internet is running full at around 150mb/s, it could have downloaded like 500GB for this time taken.

is there something wrong with the chunking that it's taking so much bandwidth and taking so long?

I'm not familiar with xarray, is this expected to take this long?

from the python code comment it said
# Save to Zarr with chunks of size 1 along time dimension
# Can take about 1 hour to save 10.7GB of data at 40MB/s

but it's taken more time than this and more bandwidth

edit: thought this might help, these warning below were shown when running

/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "level" starting at index 34. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(
/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "latitude" starting at index 701. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(
/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "longitude" starting at index 1404. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(

@weiji14
Copy link
Owner

weiji14 commented Feb 6, 2024

Oh yeah, I mentioned at #3 (comment) that I forgot to update that comment after commit a904131 😅 It took about 793min or 13.2 hours for me to download 18.2GB of data. You're correct that the rechunking makes things slower. It might actually be faster to download the Zarr store first, and then rechunk it, but that would require some knowledge of xarray.

@weiji14
Copy link
Owner

weiji14 commented Feb 6, 2024

If you do want to try though. Steps are

  1. Disable the rechunking on the fly, comment out lines L41-45 here:

ds_rechunked: xr.Dataset = ds_500hpa_zuv.chunk(
time=1,
latitude=len(ds_500hpa_zuv.latitude),
longitude=len(ds_500hpa_zuv.longitude),
)

  1. After the file is saved to 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr, you'll need to open it and do the rechunking. Something like so:
store_name: str = "2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr"
ds: xr.Dataset = xr.open_dataset(
    filename_or_obj=store_name, engine="zarr", chunks="auto", consolidated=True,
)
ds_rechunked: xr.Dataset = ds_500hpa_zuv.chunk( 
    time=1, 
    latitude=len(ds.latitude), 
    longitude=len(ds.longitude), 
) 
ds_rechunked.to_zarr(store="2020-full_37-6h-0p25deg-chunk-1_zuv500_rechunked.zarr", consolidated=True, zarr_version=2)

@andycjw
Copy link
Author

andycjw commented Feb 7, 2024

how big is the full .zarr datastore to save it on disk first?
I suspect it's taking so slow is because it kept reading the whole zarr to ram and take a subset chunk, then re-read the whole file from the internet to ram again in a loop

@andycjw
Copy link
Author

andycjw commented Feb 7, 2024

I got this error after trying to save the dataset first

...., 1, 1), (34, 3), (701, 20), (1404, 36)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

@weiji14
Copy link
Owner

weiji14 commented Feb 7, 2024

how big is the full .zarr datastore to save it on disk first?

The rechunked Zarr store should be 18.2GB, the original one should be about the same I think, I don't think the chunk sizes change the disk usage too much

I got this error after trying to save the dataset first

...., 1, 1), (34, 3), (701, 20), (1404, 36)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

Hmm, could you try using unify_chunks()? Something like:

ds_rechunked: xr.Dataset = ds_500hpa_zuv.unify_chunks()
ds_rechunked.to_zarr(store=store_name, consolidated=True, zarr_version=2)

@andycjw
Copy link
Author

andycjw commented Feb 7, 2024

I tried running this on google colab, I can see it's running at 1gb/s full speed downloading, and it's still taking long time to download, already ran for like 1:30 hour, 18GB shouldn't take this long, something must be wrong.

is it trying to read thru the whole dataset (which might be in TB) to get this subset of 18GB?

update: finally completed running after almost 3 hours, below is the du -h output, it's 65G in total

20K 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/longitude
22G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/v_component_of_wind
16K 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/latitude
22G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/geopotential
22G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/u_component_of_wind
192K 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/time
65G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/

downloading 65G shouldn't take 3 hours on colab google datacenter fast pipe, 1g/s @ 3hours = 3 *60 *60 *100mb ~= 1TB, why are we downloading 1TB of data to get 65G of sub dataset?

@weiji14
Copy link
Owner

weiji14 commented Feb 12, 2024

Glad you got it running!

I would be surprised if we did download 1TB of data 😅 To be honest, I was rushing to prepare the presentation, and really just needed some sample data to run the benchmarks, so didn't focus on making that part faster. The download script could definitely be a lot more optimized I'm sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants