Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filters for object dtype #208

Closed
alimanfoo opened this issue Nov 28, 2017 · 18 comments · Fixed by #212
Closed

Filters for object dtype #208

alimanfoo opened this issue Nov 28, 2017 · 18 comments · Fixed by #212
Assignees
Milestone

Comments

@alimanfoo
Copy link
Member

Currently using object dtype can be unsafe, because without a filter to encode the objects (currently either Pickle or MsgPack) you can get segfaults.

One option would be to check for object dtype and no filters provided by user, and then insert either MsgPack or Pickle into the filters. Question is what to do if the user also provides some filters.

@alimanfoo alimanfoo added this to the v2.2 milestone Nov 28, 2017
@jakirkham
Copy link
Member

I guess raising an error could already be an improvement over a segfault.

@alimanfoo
Copy link
Member Author

Yes, although what do you check for to raise the error? No filters provided by user?

@jakirkham
Copy link
Member

jakirkham commented Nov 28, 2017

I guess check they are using MsgPack or Pickle at least. Though IDK if there are other ways they could avoid the issue.

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 28, 2017 via email

@jakirkham
Copy link
Member

Maybe we could soften it to a warning? Guessing the segmentation fault occurs when storing data.

@shoyer
Copy link
Contributor

shoyer commented Nov 28, 2017

I'd just like to stress that pickle should never be used without an explicit opt-in because could is a serious security vulnerability: unpickling can execute arbitrary code. Untrusted data is somewhat rare in scientific computing but I would still prefer not to build in these sorts of vulnerabilities.

Message pack is more palatable as a default, but it's not part of the zarr spec, so it would still be surprising to me to see data encoded that way.

I agree that segfaults are bad, but the preferred alternative is probably to raise an error in Python, not to assume any particular codec (all of which have tradeoffs).

@jakirkham
Copy link
Member

Maybe we could get the best of both worlds. Namely create a special ZarrObjectWarning (or similar) that we issue if we detect object might by serialized. We can then use simplefilter to set default behavior of this warning to "error". Then users get a traceback by default if they don't add MsgPack or Pickle filters (as opposed to a seg fault). Though users are free to turn this back into a warning (not an error) if they want to try some custom filter. It just will be default behavior to error instead.

@mrocklin
Copy link
Contributor

I'm in favor of msgpack by default. It's widely used and decently fast.

@mrocklin
Copy link
Contributor

Or, if we're very conservative, would JSON be a sensible default?

@alimanfoo
Copy link
Member Author

Found another way that arrays with object dtype are currently broken if no filters are provided:

In [1]: import zarr

In [2]: z = zarr.zeros(5, chunks=2, dtype=object)

In [3]: z[0] = 'foo'

In [5]: z[1] = 'bar'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-98f6ee1706b9> in <module>()
----> 1 z[1] = 'bar'

~/src/github/alimanfoo/zarr/zarr/core.py in __setitem__(self, selection, value)
   1095 
   1096         fields, selection = pop_fields(selection)
-> 1097         self.set_basic_selection(selection, value, fields=fields)
   1098 
   1099     def set_basic_selection(self, selection, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in set_basic_selection(self, selection, value, fields)
   1190             return self._set_basic_selection_zd(selection, value, fields=fields)
   1191         else:
-> 1192             return self._set_basic_selection_nd(selection, value, fields=fields)
   1193 
   1194     def set_orthogonal_selection(self, selection, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in _set_basic_selection_nd(self, selection, value, fields)
   1481         indexer = BasicIndexer(selection, self)
   1482 
-> 1483         self._set_selection(indexer, value, fields=fields)
   1484 
   1485     def _set_selection(self, indexer, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in _set_selection(self, indexer, value, fields)
   1523 
   1524             # put data
-> 1525             self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
   1526 
   1527     def _chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection,

~/src/github/alimanfoo/zarr/zarr/core.py in _chunk_setitem(self, chunk_coords, chunk_selection, value, fields)
   1627         with lock:
   1628             self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
-> 1629                                        fields=fields)
   1630 
   1631     def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields)
   1686 
   1687                 # decode chunk
-> 1688                 chunk = self._decode_chunk(cdata)
   1689                 if not chunk.flags.writeable:
   1690                     chunk = chunk.copy(order='K')

~/src/github/alimanfoo/zarr/zarr/core.py in _decode_chunk(self, cdata)
   1724             chunk = chunk.view(self._dtype)
   1725         else:
-> 1726             chunk = np.frombuffer(chunk, self._dtype)
   1727 
   1728         # reshape

ValueError: cannot create an OBJECT array from memory buffer

@alimanfoo
Copy link
Member Author

More brokenness:

In [23]: z = zarr.zeros(5, chunks=2, dtype=object)

In [24]: z[0] = b'foo'

In [25]: z[:]
Out[25]: Segmentation fault (core dumped)

@alimanfoo
Copy link
Member Author

Stumbled on some work in fastparquet, looks like a lot of good ideas for how to handle object encoding:

https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py
https://github.com/dask/fastparquet/blob/master/fastparquet/speedups.pyx

@mrocklin
Copy link
Contributor

mrocklin commented Nov 30, 2017 via email

@martindurant
Copy link
Member

Is there any way I can help?

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 30, 2017 via email

@alimanfoo alimanfoo mentioned this issue Nov 30, 2017
4 tasks
@alimanfoo alimanfoo self-assigned this Nov 30, 2017
@alimanfoo alimanfoo added the in progress Someone is currently working on this label Nov 30, 2017
@alimanfoo
Copy link
Member Author

OK, I have a proposal in #212 which hopefully addresses this. Comments welcome.

@alimanfoo
Copy link
Member Author

Just to mention, I ran some benchmarks for codecs that can encode variable length text strings, comparing MsgPack, JSON, Pickle, and fastparquet's UTF8 encoder. MsgPack comes out well. The fastparquet encoder is fastest for encoding, but not by a huge amount, and decoding speed is similar to MsgPack. I had been considering adding a parquet UTF8 codec to numcodecs based on the fastparquet implementation (xref https://github.com/alimanfoo/numcodecs/issues/50) but given that the performance gain is not massive over MsgPack I think I'll hold off for the moment. Happy to reconsider if there's interest.

@alimanfoo alimanfoo removed the in progress Someone is currently working on this label Dec 6, 2017
@jakirkham
Copy link
Member

Adding this for posterity.

Originally the code in this comment passed when it shouldn't have. When running the same code with 2.2.0, we get the following error as expected.

In [1]: import numpy as np
   ...: import zarr
   ...: 
   ...: values = np.array([b'ab', b'cdef', np.nan], dtype=object)
   ...: zgs = zarr.open_group(store='zarr_directory') 
   ...: zgs.create('x', shape=values.shape, dtype=values.dtype)
   ...: zgs.x[:] = values
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-685b49595047> in <module>()
      4 values = np.array([b'ab', b'cdef', np.nan], dtype=object)
      5 zgs = zarr.open_group(store='zarr_directory')
----> 6 zgs.create('x', shape=values.shape, dtype=values.dtype)
      7 zgs.x[:] = values

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in create(self, name, **kwargs)
    849         """Create an array. Keyword arguments as per
    850         :func:`zarr.creation.create`."""
--> 851         return self._write_op(self._create_nosync, name, **kwargs)
    852 
    853     def _create_nosync(self, name, **kwargs):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in _write_op(self, f, *args, **kwargs)
    628 
    629         with lock:
--> 630             return f(*args, **kwargs)
    631 
    632     def create_group(self, name, overwrite=False):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in _create_nosync(self, name, **kwargs)
    856         kwargs.setdefault('cache_attrs', self.attrs.cache)
    857         return create(store=self._store, path=path, chunk_store=self._chunk_store,
--> 858                       **kwargs)
    859 
    860     def empty(self, name, **kwargs):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, **kwargs)
    117     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    118                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
--> 119                chunk_store=chunk_store, filters=filters, object_codec=object_codec)
    120 
    121     # instantiate array

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    311                          order=order, overwrite=overwrite, path=path,
    312                          chunk_store=chunk_store, filters=filters,
--> 313                          object_codec=object_codec)
    314 
    315 

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    365             if not filters:
    366                 # there are no filters so we can be sure there is no object codec
--> 367                 raise ValueError('missing object_codec for object array')
    368             else:
    369                 # one of the filters may be an object codec, issue a warning rather

ValueError: missing object_codec for object array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants