downloading dataset took longer than specified #10

andycjw · 2024-02-06T14:06:27Z

running 'python 0_weatherbench2zarr.py' for almost 12 hours, and 'du -h' on the zarr directory, showed it downloaded only 2.2GB of data.

I can see the internet is running full at around 150mb/s, it could have downloaded like 500GB for this time taken.

is there something wrong with the chunking that it's taking so much bandwidth and taking so long?

I'm not familiar with xarray, is this expected to take this long?

from the python code comment it said
# Save to Zarr with chunks of size 1 along time dimension
# Can take about 1 hour to save 10.7GB of data at 40MB/s

but it's taken more time than this and more bandwidth

edit: thought this might help, these warning below were shown when running

/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "level" starting at index 34. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(
/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "latitude" starting at index 701. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(
/envs/foss4g2023oceania/lib/python3.10/site-packages/xarray/core/dataset.py:270: UserWarning: The specified chunks separate the stored chunks along dimension "longitude" starting at index 1404. This could degrade performance. Instead, consider rechunking after loading.
warnings.warn(

weiji14 · 2024-02-06T19:16:29Z

Oh yeah, I mentioned at #3 (comment) that I forgot to update that comment after commit a904131 😅 It took about 793min or 13.2 hours for me to download 18.2GB of data. You're correct that the rechunking makes things slower. It might actually be faster to download the Zarr store first, and then rechunk it, but that would require some knowledge of xarray.

weiji14 · 2024-02-06T19:22:44Z

If you do want to try though. Steps are

Disable the rechunking on the fly, comment out lines L41-45 here:

foss4g2023oceania/0_weatherbench2zarr.py

Lines 41 to 45 in 04e87ab

    
           ds_rechunked: xr.Dataset = ds_500hpa_zuv.chunk( 
        
               time=1, 
        
               latitude=len(ds_500hpa_zuv.latitude), 
        
               longitude=len(ds_500hpa_zuv.longitude), 
        
           )

After the file is saved to 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr, you'll need to open it and do the rechunking. Something like so:

store_name: str = "2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr"
ds: xr.Dataset = xr.open_dataset(
    filename_or_obj=store_name, engine="zarr", chunks="auto", consolidated=True,
)
ds_rechunked: xr.Dataset = ds_500hpa_zuv.chunk( 
    time=1, 
    latitude=len(ds.latitude), 
    longitude=len(ds.longitude), 
) 
ds_rechunked.to_zarr(store="2020-full_37-6h-0p25deg-chunk-1_zuv500_rechunked.zarr", consolidated=True, zarr_version=2)

andycjw · 2024-02-07T03:31:27Z

how big is the full .zarr datastore to save it on disk first?
I suspect it's taking so slow is because it kept reading the whole zarr to ram and take a subset chunk, then re-read the whole file from the internet to ram again in a loop

andycjw · 2024-02-07T03:45:32Z

I got this error after trying to save the dataset first

...., 1, 1), (34, 3), (701, 20), (1404, 36)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

weiji14 · 2024-02-07T05:04:49Z

how big is the full .zarr datastore to save it on disk first?

The rechunked Zarr store should be 18.2GB, the original one should be about the same I think, I don't think the chunk sizes change the disk usage too much

I got this error after trying to save the dataset first

...., 1, 1), (34, 3), (701, 20), (1404, 36)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using chunk(), deleting or modifying encoding['chunks'], or specify safe_chunks=False.

Hmm, could you try using unify_chunks()? Something like:

ds_rechunked: xr.Dataset = ds_500hpa_zuv.unify_chunks()
ds_rechunked.to_zarr(store=store_name, consolidated=True, zarr_version=2)

andycjw · 2024-02-07T08:19:42Z

I tried running this on google colab, I can see it's running at 1gb/s full speed downloading, and it's still taking long time to download, already ran for like 1:30 hour, 18GB shouldn't take this long, something must be wrong.

is it trying to read thru the whole dataset (which might be in TB) to get this subset of 18GB?

update: finally completed running after almost 3 hours, below is the du -h output, it's 65G in total

20K 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/longitude
22G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/v_component_of_wind
16K 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/latitude
22G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/geopotential
22G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/u_component_of_wind
192K 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/time
65G 2020-full_37-6h-0p25deg-chunk-1_zuv500.zarr/

downloading 65G shouldn't take 3 hours on colab google datacenter fast pipe, 1g/s @ 3hours = 3 *60 *60 *100mb ~= 1TB, why are we downloading 1TB of data to get 65G of sub dataset?

weiji14 · 2024-02-12T00:57:45Z

Glad you got it running!

I would be surprised if we did download 1TB of data 😅 To be honest, I was rushing to prepare the presentation, and really just needed some sample data to run the benchmarks, so didn't focus on making that part faster. The download script could definitely be a lot more optimized I'm sure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

downloading dataset took longer than specified #10

downloading dataset took longer than specified #10

andycjw commented Feb 6, 2024 •

edited

Loading

weiji14 commented Feb 6, 2024

weiji14 commented Feb 6, 2024

andycjw commented Feb 7, 2024

andycjw commented Feb 7, 2024

weiji14 commented Feb 7, 2024

andycjw commented Feb 7, 2024 •

edited

Loading

weiji14 commented Feb 12, 2024

downloading dataset took longer than specified #10

downloading dataset took longer than specified #10

Comments

andycjw commented Feb 6, 2024 • edited Loading

weiji14 commented Feb 6, 2024

weiji14 commented Feb 6, 2024

andycjw commented Feb 7, 2024

andycjw commented Feb 7, 2024

weiji14 commented Feb 7, 2024

andycjw commented Feb 7, 2024 • edited Loading

weiji14 commented Feb 12, 2024

andycjw commented Feb 6, 2024 •

edited

Loading

andycjw commented Feb 7, 2024 •

edited

Loading