You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a multifile dataset made up of month-long 8-hourly netcdf datasets over nearly 30 years. The files are available from ftp://ftp.ifremer.fr/ifremer/ww3/HINDCAST/GLOBAL/, and I'm spcifically looking at e.g. 1990_CFSR/hs/ww3.199001_hs.nc for each year and month. Each file is about 45Mb, for about 15Gb total.
I want to calculate some lognormal distribution parameters of the Hs variable at each grid point (actually, only a smallish subset of points, using a mask). However, if I load the data with open_mfdataset and try to read a single lat/lon grid cell, my computer tanks, and python gets killed due to running out of memory (I have 16Gb, but even if I only try to open 1 year of data - ~500Mb, python ends up using 27% of my memory).
Is there a way in xarray/dask to force dask to only read single sub-arrays at a time? I have tried using lat/lon chunking, e.g.
Is there any way around this problem? I guess I could try using preprocess= to sub-select grid cells, and loop over that, but that seems like it would require opening and reading each file 317*720 times, which sounds like a recipe for a long wait.
The text was updated successfully, but these errors were encountered:
naught101
changed the title
Reading single grid cells from a? multi-file netcdf dataset
Reading single grid cells from a multi-file netcdf dataset?
May 22, 2019
Have you seen #1823? It sounds like you might be having the same issue: xarray loads coordinate information into memory to check alignment is correct, but for many datasets with large coordinate arrays this could be prohibitive.
You know your variables are aligned so you could try the workaround suggested in that thread: give the coordinates to drop_variables, then update them from a single master dataset (because presumably your latitude and longitude don't depend on time!)
I have a multifile dataset made up of month-long 8-hourly netcdf datasets over nearly 30 years. The files are available from
ftp://ftp.ifremer.fr/ifremer/ww3/HINDCAST/GLOBAL/
, and I'm spcifically looking at e.g.1990_CFSR/hs/ww3.199001_hs.nc
for each year and month. Each file is about 45Mb, for about 15Gb total.I want to calculate some lognormal distribution parameters of the Hs variable at each grid point (actually, only a smallish subset of points, using a mask). However, if I load the data with
open_mfdataset
and try to read a single lat/lon grid cell, my computer tanks, and python gets killed due to running out of memory (I have 16Gb, but even if I only try to open 1 year of data - ~500Mb, python ends up using 27% of my memory).Is there a way in xarray/dask to force dask to only read single sub-arrays at a time? I have tried using lat/lon chunking, e.g.
but that doesn't seem to improve things.
Is there any way around this problem? I guess I could try using
preprocess=
to sub-select grid cells, and loop over that, but that seems like it would require opening and reading each file 317*720 times, which sounds like a recipe for a long wait.The text was updated successfully, but these errors were encountered: