Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_netcdf() fails to append to an existing file #1215

Closed
fmaussion opened this issue Jan 17, 2017 · 14 comments
Closed

to_netcdf() fails to append to an existing file #1215

fmaussion opened this issue Jan 17, 2017 · 14 comments

Comments

@fmaussion
Copy link
Member

The following code used to work well in v0.8.2:

import os
import xarray as xr

path = 'test.nc'
if os.path.exists(path):
    os.remove(path)
    
ds = xr.Dataset()
ds['dim'] = ('dim', [0, 1, 2])
ds['var1'] = ('dim', [10, 11, 12])
ds.to_netcdf(path)

ds = xr.Dataset()
ds['dim'] = ('dim', [0, 1, 2])
ds['var2'] = ('dim', [10, 11, 12])
ds.to_netcdf(path, 'a')

On master, it fails with:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-fce5f5e876aa> in <module>()
     14 ds['dim'] = ('dim', [0, 1, 2])
     15 ds['var2'] = ('dim', [10, 11, 12])
---> 16 ds.to_netcdf(path, 'a')

/home/mowglie/Documents/git/xarray/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding)
    927         from ..backends.api import to_netcdf
    928         return to_netcdf(self, path, mode, format=format, group=group,
--> 929                          engine=engine, encoding=encoding)
    930 
    931     def __unicode__(self):

/home/mowglie/Documents/git/xarray/xarray/backends/api.py in to_netcdf(dataset, path, mode, format, group, engine, writer, encoding)
    563     store = store_cls(path, mode, format, group, writer)
    564     try:
--> 565         dataset.dump_to_store(store, sync=sync, encoding=encoding)
    566         if isinstance(path, BytesIO):
    567             return path.getvalue()

/home/mowglie/Documents/git/xarray/xarray/core/dataset.py in dump_to_store(self, store, encoder, sync, encoding)
    873             variables, attrs = encoder(variables, attrs)
    874 
--> 875         store.store(variables, attrs, check_encoding)
    876         if sync:
    877             store.sync()

/home/mowglie/Documents/git/xarray/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set)
    219         cf_variables, cf_attrs = cf_encoder(variables, attributes)
    220         AbstractWritableDataStore.store(self, cf_variables, cf_attrs,
--> 221                                         check_encoding_set)
    222 
    223 

/home/mowglie/Documents/git/xarray/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set)
    194     def store(self, variables, attributes, check_encoding_set=frozenset()):
    195         self.set_attributes(attributes)
--> 196         self.set_variables(variables, check_encoding_set)
    197 
    198     def set_attributes(self, attributes):

/home/mowglie/Documents/git/xarray/xarray/backends/common.py in set_variables(self, variables, check_encoding_set)
    204             name = _encode_variable_name(vn)
    205             check = vn in check_encoding_set
--> 206             target, source = self.prepare_variable(name, v, check)
    207             self.writer.add(source, target)
    208 

/home/mowglie/Documents/git/xarray/xarray/backends/netCDF4_.py in prepare_variable(self, name, variable, check_encoding)
    293             endian='native',
    294             least_significant_digit=encoding.get('least_significant_digit'),
--> 295             fill_value=fill_value)
    296         nc4_var.set_auto_maskandscale(False)
    297 

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createVariable (netCDF4/_netCDF4.c:18740)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__init__ (netCDF4/_netCDF4.c:30713)()

RuntimeError: NetCDF: String match to name in use
@fmaussion
Copy link
Member Author

An even simpler example:

import os
import xarray as xr

path = 'test.nc'
if os.path.exists(path):
    os.remove(path)
    
ds = xr.Dataset()
ds['dim'] = ('dim', [0, 1, 2])
ds['var1'] = ('dim', [10, 11, 12])
ds['var2'] = ('dim', [13, 14, 15])

ds[['var1']].to_netcdf(path)
ds[['var2']].to_netcdf(path, 'a')

@fmaussion
Copy link
Member Author

Note that the problems occurs because the backend wants to write the 'dim' coordinate each time. At the second call, the coordinate variable already exists and this raises the error.

@shoyer shoyer added the bug label Jan 19, 2017
@shoyer shoyer added this to the 0.9.0 milestone Jan 19, 2017
@shoyer
Copy link
Member

shoyer commented Jan 19, 2017

Good catch! Marking this as a bug.

@fmaussion
Copy link
Member Author

I did a few tests: the regression happened in #1017

Something in the way coordinates variables have changes implies that the writing is happening differently now. The question is whether this should now be handled downstream (in the netcdf backend) or upstream (at the dataset level)?

@shoyer
Copy link
Member

shoyer commented Jan 22, 2017

OK, I understand what's going on now.

Previously, we had a hack that disabled writing variables along dimensions of the form [0, 1, ..., n-1] to disk, because these corresponded to default coordinates and would get created automatically. We disable this hack as part of #1017, because it was no longer necessary.

So although your example worked in v0.8.2, this small variation did not, because we call netCDF4.Dataset.createVariable twice with a dimension of the name 'dim':

ds = xr.Dataset()
ds['dim'] = ('dim', [1, 2, 3])
ds['var1'] = ('dim', [10, 11, 12])
ds.to_netcdf(path)

ds = xr.Dataset()
ds['dim'] = ('dim', [1, 2, 3])
ds['var2'] = ('dim', [10, 11, 12])
ds.to_netcdf(path, 'a')

I find it reassuring that this only worked in limited cases before, so it unlikely that many users are depending on this functionality. It would be nice if mode='a' worked to append new variables to an existing netCDF file in the case of overlapping variables, but perhaps we don't need to fix this for v0.9.

My main concern with squeezing this in is that the proper behavior is not entirely clear and will need to go through some review:

  • Do we load existing variable values to check them for equality with the new values, or alternatively always skip or override them?
  • How do we handle cases where dims, attrs or encoding differs from the exiting variable? Do we attempt to delete and replace the existing variable, update it inplace or error?

@fmaussion
Copy link
Member Author

fmaussion commented Jan 22, 2017

I see.

but perhaps we don't need to fix this for v0.9.

Agreed, but it would be good to get this working some day. For now I can see an easy workaround for my purposes.

Another possibility would be to give the user control on whether existing variables should be ignored, overwritten or raise an error when appending to a file.

@fmaussion fmaussion removed this from the 0.9.0 milestone Jan 22, 2017
@shoyer shoyer changed the title v0.9.0: to_netcdf() fails to append to an existing file to_netcdf() fails to append to an existing file Jan 25, 2017
@shoyer shoyer removed the bug label Jan 25, 2017
@jhamman
Copy link
Member

jhamman commented Oct 4, 2017

@fmaussion and @shoyer - I have a use case that could use this. I'm wondering if either of you have looked at this any further since January?

If not, I'll propose a path forward that fits my use case and we can iterate on the details until we're satisfied:

Do we load existing variable values to check them for equality with the new values, or alternatively always skip or override them?

I don't think loading variables already written to disk is practical. My preference would be to only append missing variables/coordinates.

How do we handle cases where dims, attrs or encoding differs from the exiting variable? Do we attempt to delete and replace the existing variable, update it inplace or error?

differing dims: raise an error

I'd like to implement this but to keep it as simple as possible. A trivial use case like this should work:

fname = 'out.nc'
dates = pd.date_range('2016-01-01', freq='1D', periods=45)
ds = xr.Dataset()
for var in ['A', 'B', 'C']:
    ds[var] = xr.DataArray(np.random.random((len(dates), 4, 5)),
                           dims=('time', 'x', 'y'), coords={'time': dates})

for var in ds.data_vars:
    ds[[var]].to_netcdf(fname, mode='a')

@fmaussion
Copy link
Member Author

@jhamman no I haven't looked into this any further (and I also forgot what my workaround at that time actually was).

I also think your example should work, and that we should never check for values on disk: if the dims and coordinates names match, write the variable and assume the coordinates are ok.

If the variable already exists on file, match the behavior of netCDF4 (I actually don't know what netCDF4 does in that case)

@shoyer
Copy link
Member

shoyer commented Oct 4, 2017 via email

@TWellman
Copy link

Is it now possible to append to a netCDF file using xarray? I have some tabular data that is read into a dataframe in chunks from a large file. The goal is write in chunks to netCDF. If so, could someone please provide a simple code example. I am receiving an RuntimeError 'NetCDF: String match to name in use' as well. Thank you.

@jhamman
Copy link
Member

jhamman commented Oct 12, 2017

@TWellman - not yet, see #1215.

@shoyer
Copy link
Member

shoyer commented Oct 12, 2017

Is it now possible to append to a netCDF file using xarray?

No, it is not. This issue is about appending new variables to an existing netCDF file.

I think what you are looking for is to append along existing dimensions to a netCDF file. This is possible in the netCDF data model, but not yet supported by xarray. See #1398 for some discussion.

For these types of use cases, I would generally recommend writing a new netCDF file, and then loading everything afterwards using xarray.open_mfdataset.

@TWellman
Copy link

Thank you! I will give xarray.open_mfdataset a shot. Just one question - is this approach memory conservative? My reasoning for chunking in the first place is large file size.

@shoyer
Copy link
Member

shoyer commented Oct 12, 2017

I will give xarray.open_mfdataset a shot. Just one question - is this approach memory conservative? My reasoning for chunking in the first place is large file size.

Yes, open_mfdataset uses dask, which allows for streaming computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants