Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading single grid cells from a multi-file netcdf dataset? #2979

Open
naught101 opened this issue May 22, 2019 · 1 comment
Open

Reading single grid cells from a multi-file netcdf dataset? #2979

naught101 opened this issue May 22, 2019 · 1 comment

Comments

@naught101
Copy link

naught101 commented May 22, 2019

I have a multifile dataset made up of month-long 8-hourly netcdf datasets over nearly 30 years. The files are available from ftp://ftp.ifremer.fr/ifremer/ww3/HINDCAST/GLOBAL/, and I'm spcifically looking at e.g. 1990_CFSR/hs/ww3.199001_hs.nc for each year and month. Each file is about 45Mb, for about 15Gb total.

I want to calculate some lognormal distribution parameters of the Hs variable at each grid point (actually, only a smallish subset of points, using a mask). However, if I load the data with open_mfdataset and try to read a single lat/lon grid cell, my computer tanks, and python gets killed due to running out of memory (I have 16Gb, but even if I only try to open 1 year of data - ~500Mb, python ends up using 27% of my memory).

Is there a way in xarray/dask to force dask to only read single sub-arrays at a time? I have tried using lat/lon chunking, e.g.

mfdata_glob = '/home/nedcr/cr/data/wave/*1990*.nc'
global_ds = xr.open_mfdataset(
    mfdata_glob,
    chunks={'latitude': 1, 'longitude': 1})

but that doesn't seem to improve things.

Is there any way around this problem? I guess I could try using preprocess= to sub-select grid cells, and loop over that, but that seems like it would require opening and reading each file 317*720 times, which sounds like a recipe for a long wait.

@naught101 naught101 changed the title Reading single grid cells from a? multi-file netcdf dataset Reading single grid cells from a multi-file netcdf dataset? May 22, 2019
@TomNicholas
Copy link
Member

Have you seen #1823? It sounds like you might be having the same issue: xarray loads coordinate information into memory to check alignment is correct, but for many datasets with large coordinate arrays this could be prohibitive.

You know your variables are aligned so you could try the workaround suggested in that thread: give the coordinates to drop_variables, then update them from a single master dataset (because presumably your latitude and longitude don't depend on time!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants