Filters for object dtype #208

alimanfoo · 2017-11-28T01:33:40Z

Currently using object dtype can be unsafe, because without a filter to encode the objects (currently either Pickle or MsgPack) you can get segfaults.

One option would be to check for object dtype and no filters provided by user, and then insert either MsgPack or Pickle into the filters. Question is what to do if the user also provides some filters.

jakirkham · 2017-11-28T01:35:19Z

I guess raising an error could already be an improvement over a segfault.

alimanfoo · 2017-11-28T01:38:02Z

Yes, although what do you check for to raise the error? No filters provided by user?

jakirkham · 2017-11-28T01:40:16Z

I guess check they are using MsgPack or Pickle at least. Though IDK if there are other ways they could avoid the issue.

alimanfoo · 2017-11-28T02:16:28Z

Yep that would be the safe option. Difficulty is that in theory someone could implement their own codec to handle objects, e.g., if they wanted to explore better ways of encoding strings. I we raise an error if neither of the known object codecs are present, we prevent them from using their own codec. I don't know if anyone will do this but I wouldn't want to shut down the possibility.

…

On Tue, Nov 28, 2017 at 1:40 AM, jakirkham ***@***.***> wrote: I guess check they are using MsgPack or Pickle at least. Though IDK if there other ways they could avoid the issue. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/208#issuecomment-347385961>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QsAuR7XjXUVygmuMt86XwuFXYLv8ks5s62SBgaJpZM4QsjgK> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham · 2017-11-28T03:08:47Z

Maybe we could soften it to a warning? Guessing the segmentation fault occurs when storing data.

shoyer · 2017-11-28T06:21:30Z

I'd just like to stress that pickle should never be used without an explicit opt-in because could is a serious security vulnerability: unpickling can execute arbitrary code. Untrusted data is somewhat rare in scientific computing but I would still prefer not to build in these sorts of vulnerabilities.

Message pack is more palatable as a default, but it's not part of the zarr spec, so it would still be surprising to me to see data encoded that way.

I agree that segfaults are bad, but the preferred alternative is probably to raise an error in Python, not to assume any particular codec (all of which have tradeoffs).

jakirkham · 2017-11-29T17:23:51Z

Maybe we could get the best of both worlds. Namely create a special ZarrObjectWarning (or similar) that we issue if we detect object might by serialized. We can then use simplefilter to set default behavior of this warning to "error". Then users get a traceback by default if they don't add MsgPack or Pickle filters (as opposed to a seg fault). Though users are free to turn this back into a warning (not an error) if they want to try some custom filter. It just will be default behavior to error instead.

mrocklin · 2017-11-29T20:12:37Z

I'm in favor of msgpack by default. It's widely used and decently fast.

mrocklin · 2017-11-29T20:13:59Z

Or, if we're very conservative, would JSON be a sensible default?

alimanfoo · 2017-11-30T00:23:15Z

Found another way that arrays with object dtype are currently broken if no filters are provided:

In [1]: import zarr

In [2]: z = zarr.zeros(5, chunks=2, dtype=object)

In [3]: z[0] = 'foo'

In [5]: z[1] = 'bar'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-98f6ee1706b9> in <module>()
----> 1 z[1] = 'bar'

~/src/github/alimanfoo/zarr/zarr/core.py in __setitem__(self, selection, value)
   1095 
   1096         fields, selection = pop_fields(selection)
-> 1097         self.set_basic_selection(selection, value, fields=fields)
   1098 
   1099     def set_basic_selection(self, selection, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in set_basic_selection(self, selection, value, fields)
   1190             return self._set_basic_selection_zd(selection, value, fields=fields)
   1191         else:
-> 1192             return self._set_basic_selection_nd(selection, value, fields=fields)
   1193 
   1194     def set_orthogonal_selection(self, selection, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in _set_basic_selection_nd(self, selection, value, fields)
   1481         indexer = BasicIndexer(selection, self)
   1482 
-> 1483         self._set_selection(indexer, value, fields=fields)
   1484 
   1485     def _set_selection(self, indexer, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in _set_selection(self, indexer, value, fields)
   1523 
   1524             # put data
-> 1525             self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
   1526 
   1527     def _chunk_getitem(self, chunk_coords, chunk_selection, out, out_selection,

~/src/github/alimanfoo/zarr/zarr/core.py in _chunk_setitem(self, chunk_coords, chunk_selection, value, fields)
   1627         with lock:
   1628             self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
-> 1629                                        fields=fields)
   1630 
   1631     def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=None):

~/src/github/alimanfoo/zarr/zarr/core.py in _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields)
   1686 
   1687                 # decode chunk
-> 1688                 chunk = self._decode_chunk(cdata)
   1689                 if not chunk.flags.writeable:
   1690                     chunk = chunk.copy(order='K')

~/src/github/alimanfoo/zarr/zarr/core.py in _decode_chunk(self, cdata)
   1724             chunk = chunk.view(self._dtype)
   1725         else:
-> 1726             chunk = np.frombuffer(chunk, self._dtype)
   1727 
   1728         # reshape

ValueError: cannot create an OBJECT array from memory buffer

alimanfoo · 2017-11-30T09:39:45Z

More brokenness:

In [23]: z = zarr.zeros(5, chunks=2, dtype=object)

In [24]: z[0] = b'foo'

In [25]: z[:]
Out[25]: Segmentation fault (core dumped)

alimanfoo · 2017-11-30T10:28:18Z

Stumbled on some work in fastparquet, looks like a lot of good ideas for how to handle object encoding:

https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py
https://github.com/dask/fastparquet/blob/master/fastparquet/speedups.pyx

mrocklin · 2017-11-30T12:49:05Z

@martindurant might be glad to hear that :)

…

On Thu, Nov 30, 2017 at 5:28 AM, Alistair Miles ***@***.***> wrote: Stumbled on some work in fastparquet, looks like a lot of good ideas for how to handle object encoding: https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py https://github.com/dask/fastparquet/blob/master/fastparquet/speedups.pyx — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/208#issuecomment-348147282>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLyYiTWEzbeAe-7_pBJaZ7CQgQyqks5s7oNDgaJpZM4QsjgK> .

martindurant · 2017-11-30T14:05:24Z

Is there any way I can help?

alimanfoo · 2017-11-30T14:10:20Z

Thanks Martin. I think I need to come up with a design proposal to handle object encoding better. I have something in mind so I'll try to write something up tonight. Your thoughts on that would be much appreciated. Also I'd like to come back with some questions about how encoding is done in fastparquet, particularly for text (utf-8), but I'll look again at the source code first to glean what I can.

…

On Thu, Nov 30, 2017 at 2:05 PM, Martin Durant ***@***.***> wrote: Is there any way I can help? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/alimanfoo/zarr/issues/208#issuecomment-348196961>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qs_GRtqXKS_4IEHb38qMhqPGpQaVks5s7rYlgaJpZM4QsjgK> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

alimanfoo · 2017-11-30T23:50:58Z

OK, I have a proposal in #212 which hopefully addresses this. Comments welcome.

alimanfoo · 2017-12-04T01:49:38Z

Just to mention, I ran some benchmarks for codecs that can encode variable length text strings, comparing MsgPack, JSON, Pickle, and fastparquet's UTF8 encoder. MsgPack comes out well. The fastparquet encoder is fastest for encoding, but not by a huge amount, and decoding speed is similar to MsgPack. I had been considering adding a parquet UTF8 codec to numcodecs based on the fastparquet implementation (xref https://github.com/alimanfoo/numcodecs/issues/50) but given that the performance gain is not massive over MsgPack I think I'll hold off for the moment. Happy to reconsider if there's interest.

jakirkham · 2018-06-13T05:34:50Z

Adding this for posterity.

Originally the code in this comment passed when it shouldn't have. When running the same code with 2.2.0, we get the following error as expected.

In [1]: import numpy as np
   ...: import zarr
   ...: 
   ...: values = np.array([b'ab', b'cdef', np.nan], dtype=object)
   ...: zgs = zarr.open_group(store='zarr_directory') 
   ...: zgs.create('x', shape=values.shape, dtype=values.dtype)
   ...: zgs.x[:] = values
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-685b49595047> in <module>()
      4 values = np.array([b'ab', b'cdef', np.nan], dtype=object)
      5 zgs = zarr.open_group(store='zarr_directory')
----> 6 zgs.create('x', shape=values.shape, dtype=values.dtype)
      7 zgs.x[:] = values

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in create(self, name, **kwargs)
    849         """Create an array. Keyword arguments as per
    850         :func:`zarr.creation.create`."""
--> 851         return self._write_op(self._create_nosync, name, **kwargs)
    852 
    853     def _create_nosync(self, name, **kwargs):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in _write_op(self, f, *args, **kwargs)
    628 
    629         with lock:
--> 630             return f(*args, **kwargs)
    631 
    632     def create_group(self, name, overwrite=False):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in _create_nosync(self, name, **kwargs)
    856         kwargs.setdefault('cache_attrs', self.attrs.cache)
    857         return create(store=self._store, path=path, chunk_store=self._chunk_store,
--> 858                       **kwargs)
    859 
    860     def empty(self, name, **kwargs):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, **kwargs)
    117     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    118                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
--> 119                chunk_store=chunk_store, filters=filters, object_codec=object_codec)
    120 
    121     # instantiate array

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    311                          order=order, overwrite=overwrite, path=path,
    312                          chunk_store=chunk_store, filters=filters,
--> 313                          object_codec=object_codec)
    314 
    315 

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    365             if not filters:
    366                 # there are no filters so we can be sure there is no object codec
--> 367                 raise ValueError('missing object_codec for object array')
    368             else:
    369                 # one of the filters may be an object codec, issue a warning rather

ValueError: missing object_codec for object array

alimanfoo added this to the v2.2 milestone Nov 28, 2017

alimanfoo mentioned this issue Nov 28, 2017

WIP: Zarr backend pydata/xarray#1528

Merged

4 tasks

alimanfoo mentioned this issue Nov 30, 2017

Object encoding #212

Merged

4 tasks

alimanfoo self-assigned this Nov 30, 2017

alimanfoo added the in progress Someone is currently working on this label Nov 30, 2017

alimanfoo closed this as completed in #212 Dec 6, 2017

alimanfoo removed the in progress Someone is currently working on this label Dec 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filters for object dtype #208

Filters for object dtype #208

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017 •

edited

Loading

alimanfoo commented Nov 28, 2017 via email

jakirkham commented Nov 28, 2017

shoyer commented Nov 28, 2017

jakirkham commented Nov 29, 2017

mrocklin commented Nov 29, 2017

mrocklin commented Nov 29, 2017

alimanfoo commented Nov 30, 2017

alimanfoo commented Nov 30, 2017

alimanfoo commented Nov 30, 2017

mrocklin commented Nov 30, 2017 via email

martindurant commented Nov 30, 2017

alimanfoo commented Nov 30, 2017 via email

alimanfoo commented Nov 30, 2017

alimanfoo commented Dec 4, 2017

jakirkham commented Jun 13, 2018

Filters for object dtype #208

Filters for object dtype #208

Comments

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017

alimanfoo commented Nov 28, 2017

jakirkham commented Nov 28, 2017 • edited Loading

alimanfoo commented Nov 28, 2017 via email

jakirkham commented Nov 28, 2017

shoyer commented Nov 28, 2017

jakirkham commented Nov 29, 2017

mrocklin commented Nov 29, 2017

mrocklin commented Nov 29, 2017

alimanfoo commented Nov 30, 2017

alimanfoo commented Nov 30, 2017

alimanfoo commented Nov 30, 2017

mrocklin commented Nov 30, 2017 via email

martindurant commented Nov 30, 2017

alimanfoo commented Nov 30, 2017 via email

alimanfoo commented Nov 30, 2017

alimanfoo commented Dec 4, 2017

jakirkham commented Jun 13, 2018

jakirkham commented Nov 28, 2017 •

edited

Loading