to_netcdf() fails to append to an existing file #1215

fmaussion · 2017-01-17T22:45:45Z

The following code used to work well in v0.8.2:

import os
import xarray as xr

path = 'test.nc'
if os.path.exists(path):
    os.remove(path)
    
ds = xr.Dataset()
ds['dim'] = ('dim', [0, 1, 2])
ds['var1'] = ('dim', [10, 11, 12])
ds.to_netcdf(path)

ds = xr.Dataset()
ds['dim'] = ('dim', [0, 1, 2])
ds['var2'] = ('dim', [10, 11, 12])
ds.to_netcdf(path, 'a')

On master, it fails with:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-fce5f5e876aa> in <module>()
     14 ds['dim'] = ('dim', [0, 1, 2])
     15 ds['var2'] = ('dim', [10, 11, 12])
---> 16 ds.to_netcdf(path, 'a')

/home/mowglie/Documents/git/xarray/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding)
    927         from ..backends.api import to_netcdf
    928         return to_netcdf(self, path, mode, format=format, group=group,
--> 929                          engine=engine, encoding=encoding)
    930 
    931     def __unicode__(self):

/home/mowglie/Documents/git/xarray/xarray/backends/api.py in to_netcdf(dataset, path, mode, format, group, engine, writer, encoding)
    563     store = store_cls(path, mode, format, group, writer)
    564     try:
--> 565         dataset.dump_to_store(store, sync=sync, encoding=encoding)
    566         if isinstance(path, BytesIO):
    567             return path.getvalue()

/home/mowglie/Documents/git/xarray/xarray/core/dataset.py in dump_to_store(self, store, encoder, sync, encoding)
    873             variables, attrs = encoder(variables, attrs)
    874 
--> 875         store.store(variables, attrs, check_encoding)
    876         if sync:
    877             store.sync()

/home/mowglie/Documents/git/xarray/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set)
    219         cf_variables, cf_attrs = cf_encoder(variables, attributes)
    220         AbstractWritableDataStore.store(self, cf_variables, cf_attrs,
--> 221                                         check_encoding_set)
    222 
    223 

/home/mowglie/Documents/git/xarray/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set)
    194     def store(self, variables, attributes, check_encoding_set=frozenset()):
    195         self.set_attributes(attributes)
--> 196         self.set_variables(variables, check_encoding_set)
    197 
    198     def set_attributes(self, attributes):

/home/mowglie/Documents/git/xarray/xarray/backends/common.py in set_variables(self, variables, check_encoding_set)
    204             name = _encode_variable_name(vn)
    205             check = vn in check_encoding_set
--> 206             target, source = self.prepare_variable(name, v, check)
    207             self.writer.add(source, target)
    208 

/home/mowglie/Documents/git/xarray/xarray/backends/netCDF4_.py in prepare_variable(self, name, variable, check_encoding)
    293             endian='native',
    294             least_significant_digit=encoding.get('least_significant_digit'),
--> 295             fill_value=fill_value)
    296         nc4_var.set_auto_maskandscale(False)
    297 

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createVariable (netCDF4/_netCDF4.c:18740)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__init__ (netCDF4/_netCDF4.c:30713)()

RuntimeError: NetCDF: String match to name in use

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-01-17T23:25:30Z

An even simpler example:

import os
import xarray as xr

path = 'test.nc'
if os.path.exists(path):
    os.remove(path)
    
ds = xr.Dataset()
ds['dim'] = ('dim', [0, 1, 2])
ds['var1'] = ('dim', [10, 11, 12])
ds['var2'] = ('dim', [13, 14, 15])

ds[['var1']].to_netcdf(path)
ds[['var2']].to_netcdf(path, 'a')

fmaussion · 2017-01-18T10:36:41Z

Note that the problems occurs because the backend wants to write the 'dim' coordinate each time. At the second call, the coordinate variable already exists and this raises the error.

shoyer · 2017-01-19T04:45:01Z

Good catch! Marking this as a bug.

fmaussion · 2017-01-19T10:05:59Z

I did a few tests: the regression happened in #1017

Something in the way coordinates variables have changes implies that the writing is happening differently now. The question is whether this should now be handled downstream (in the netcdf backend) or upstream (at the dataset level)?

shoyer · 2017-01-22T00:01:12Z

OK, I understand what's going on now.

Previously, we had a hack that disabled writing variables along dimensions of the form [0, 1, ..., n-1] to disk, because these corresponded to default coordinates and would get created automatically. We disable this hack as part of #1017, because it was no longer necessary.

So although your example worked in v0.8.2, this small variation did not, because we call netCDF4.Dataset.createVariable twice with a dimension of the name 'dim':

ds = xr.Dataset()
ds['dim'] = ('dim', [1, 2, 3])
ds['var1'] = ('dim', [10, 11, 12])
ds.to_netcdf(path)

ds = xr.Dataset()
ds['dim'] = ('dim', [1, 2, 3])
ds['var2'] = ('dim', [10, 11, 12])
ds.to_netcdf(path, 'a')

I find it reassuring that this only worked in limited cases before, so it unlikely that many users are depending on this functionality. It would be nice if mode='a' worked to append new variables to an existing netCDF file in the case of overlapping variables, but perhaps we don't need to fix this for v0.9.

My main concern with squeezing this in is that the proper behavior is not entirely clear and will need to go through some review:

Do we load existing variable values to check them for equality with the new values, or alternatively always skip or override them?
How do we handle cases where dims, attrs or encoding differs from the exiting variable? Do we attempt to delete and replace the existing variable, update it inplace or error?

fmaussion · 2017-01-22T09:22:52Z

I see.

but perhaps we don't need to fix this for v0.9.

Agreed, but it would be good to get this working some day. For now I can see an easy workaround for my purposes.

Another possibility would be to give the user control on whether existing variables should be ignored, overwritten or raise an error when appending to a file.

jhamman · 2017-10-04T18:40:39Z

@fmaussion and @shoyer - I have a use case that could use this. I'm wondering if either of you have looked at this any further since January?

If not, I'll propose a path forward that fits my use case and we can iterate on the details until we're satisfied:

Do we load existing variable values to check them for equality with the new values, or alternatively always skip or override them?

I don't think loading variables already written to disk is practical. My preference would be to only append missing variables/coordinates.

How do we handle cases where dims, attrs or encoding differs from the exiting variable? Do we attempt to delete and replace the existing variable, update it inplace or error?

differing dims: raise an error

I'd like to implement this but to keep it as simple as possible. A trivial use case like this should work:

fname = 'out.nc'
dates = pd.date_range('2016-01-01', freq='1D', periods=45)
ds = xr.Dataset()
for var in ['A', 'B', 'C']:
    ds[var] = xr.DataArray(np.random.random((len(dates), 4, 5)),
                           dims=('time', 'x', 'y'), coords={'time': dates})

for var in ds.data_vars:
    ds[[var]].to_netcdf(fname, mode='a')

fmaussion · 2017-10-04T19:09:55Z

@jhamman no I haven't looked into this any further (and I also forgot what my workaround at that time actually was).

I also think your example should work, and that we should never check for values on disk: if the dims and coordinates names match, write the variable and assume the coordinates are ok.

If the variable already exists on file, match the behavior of netCDF4 (I actually don't know what netCDF4 does in that case)

shoyer · 2017-10-04T19:48:21Z

+1, we probably don't want to read coordinates back from disk

…

On Wed, Oct 4, 2017 at 12:09 PM Fabien Maussion ***@***.***> wrote: @jhamman <https://github.com/jhamman> no I haven't looked into this any further (and I also forgot what my workaround at that time actually was). I also think your example should work, and that we should never check for values on disk: if the dims and coordinates names match, write the variable and assume the coordinates are ok. If the variable already exists on file, match the behavior of netCDF4 (I actually don't know what netCDF4 does in that case) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1215 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1u6et-S2R4rynbf2vr8bM1Y4WKdjks5so9gEgaJpZM4LmQme> .

TWellman · 2017-10-12T17:39:23Z

Is it now possible to append to a netCDF file using xarray? I have some tabular data that is read into a dataframe in chunks from a large file. The goal is write in chunks to netCDF. If so, could someone please provide a simple code example. I am receiving an RuntimeError 'NetCDF: String match to name in use' as well. Thank you.

jhamman · 2017-10-12T17:45:30Z

@TWellman - not yet, see #1215.

shoyer · 2017-10-12T17:46:42Z

Is it now possible to append to a netCDF file using xarray?

No, it is not. This issue is about appending new variables to an existing netCDF file.

I think what you are looking for is to append along existing dimensions to a netCDF file. This is possible in the netCDF data model, but not yet supported by xarray. See #1398 for some discussion.

For these types of use cases, I would generally recommend writing a new netCDF file, and then loading everything afterwards using xarray.open_mfdataset.

TWellman · 2017-10-12T17:50:33Z

Thank you! I will give xarray.open_mfdataset a shot. Just one question - is this approach memory conservative? My reasoning for chunking in the first place is large file size.

shoyer · 2017-10-12T18:25:46Z

I will give xarray.open_mfdataset a shot. Just one question - is this approach memory conservative? My reasoning for chunking in the first place is large file size.

Yes, open_mfdataset uses dask, which allows for streaming computation.

shoyer added the bug label Jan 19, 2017

shoyer added this to the 0.9.0 milestone Jan 19, 2017

fmaussion removed this from the 0.9.0 milestone Jan 22, 2017

shoyer changed the title ~~v0.9.0: to_netcdf() fails to append to an existing file~~ to_netcdf() fails to append to an existing file Jan 25, 2017

shoyer removed the bug label Jan 25, 2017

spencerkclark mentioned this issue Mar 22, 2017

Revamp main script functionality and associated examples spencerahill/aospy#155

Merged

3 tasks

jhamman mentioned this issue Oct 4, 2017

fix to_netcdf append bug (GH1215) #1609

Merged

4 tasks

shoyer closed this as completed in #1609 Oct 25, 2017

leewujung mentioned this issue Sep 13, 2018

Create netCDF checker and file naming convention OSOceanAcoustics/echopype#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_netcdf() fails to append to an existing file #1215

to_netcdf() fails to append to an existing file #1215

fmaussion commented Jan 17, 2017

fmaussion commented Jan 17, 2017

fmaussion commented Jan 18, 2017

shoyer commented Jan 19, 2017

fmaussion commented Jan 19, 2017

shoyer commented Jan 22, 2017

fmaussion commented Jan 22, 2017 •

edited

Loading

jhamman commented Oct 4, 2017

fmaussion commented Oct 4, 2017

shoyer commented Oct 4, 2017 via email

TWellman commented Oct 12, 2017

jhamman commented Oct 12, 2017

shoyer commented Oct 12, 2017

TWellman commented Oct 12, 2017

shoyer commented Oct 12, 2017

to_netcdf() fails to append to an existing file #1215

to_netcdf() fails to append to an existing file #1215

Comments

fmaussion commented Jan 17, 2017

fmaussion commented Jan 17, 2017

fmaussion commented Jan 18, 2017

shoyer commented Jan 19, 2017

fmaussion commented Jan 19, 2017

shoyer commented Jan 22, 2017

fmaussion commented Jan 22, 2017 • edited Loading

jhamman commented Oct 4, 2017

fmaussion commented Oct 4, 2017

shoyer commented Oct 4, 2017 via email

TWellman commented Oct 12, 2017

jhamman commented Oct 12, 2017

shoyer commented Oct 12, 2017

TWellman commented Oct 12, 2017

shoyer commented Oct 12, 2017

fmaussion commented Jan 22, 2017 •

edited

Loading