Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Numpy structured arrays containing an object #422

Closed
wants to merge 1 commit into from
Closed

Support Numpy structured arrays containing an object #422

wants to merge 1 commit into from

Conversation

ombschervister
Copy link

@ombschervister ombschervister commented Mar 25, 2019

partial fix #418

TODO:

  • Add unit tests and/or doctests in docstrings
  • [N/A] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [N/A] New/modified features documented in docs/tutorial.rst
  • [N/A] Changes documented in docs/release.rst
  • [N/A] Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@ombschervister
Copy link
Author

Code below works with numpy 1.14.0, but does not work with 1.13.3:

t = np.dtype([('s', object), ('i', np.int)])
out = np.zeros((), dtype=t)
out[()] = np.array((b'1', 1), dtype=t)

So zarr will require numpy 1.14.0 to support structured arrays containing object:

t = np.dtype([('s', object), ('i', np.int)])
za = zarr.empty(shape=(3,), dtype=t, object_codec=numcodecs.Pickle())
print(za[0])   # here raise ValueError on old numpy

@ombschervister
Copy link
Author

Tests fail because of numpy old version

@pep8speaks
Copy link

pep8speaks commented Mar 25, 2019

Hello @ombschervister! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-09 14:24:39 UTC

@ombschervister
Copy link
Author

ombschervister commented Mar 25, 2019

Structured arrays tests are ignored if numpy version is lower than 1.14.0.

@alimanfoo
Copy link
Member

Thanks @ombschervister, this looks good to me, no objections to adopting this change. Could you resolve conflicts and also add an item to the release notes (docs/release.rst). I'm about to go offline for a couple of weeks but hopefully someone else from @zarr-developers/core-devs can help finalise this.

@ombschervister
Copy link
Author

Thanks @ombschervister, this looks good to me, no objections to adopting this change. Could you resolve conflicts and also add an item to the release notes (docs/release.rst). I'm about to go offline for a couple of weeks but hopefully someone else from @zarr-developers/core-devs can help finalise this.

done

@ivirshup
Copy link

What's the status of this PR? If it's working, I think it might save me some work in fixing this: scverse/scanpy#832

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ombschervister! When you get a moment, could you please update to resolve the merge conflict (sorry for such a delayed response). Based on #422 (comment) I think this should be good to merge once the conflicts have been resolved.

@ivirshup
Copy link

ivirshup commented Nov 7, 2019

@jrbourbeau If I could help speed this along by resolving the conflict, I'd be happy to give it a shot.

It just looks like conflicts in the imports, so I think it might be a quick fix.

@jrbourbeau
Copy link
Member

jrbourbeau commented Nov 7, 2019

Thanks for volunteering @ivirshup. I think only @ombschervister, or others with write access to zarr-python (e.g. @jakirkham, @alimanfoo), are able to push to this branch. Perhaps we should wait a little longer for @ombschervister or someone with the proper permissions to fix things

@ivirshup
Copy link

ivirshup commented Nov 8, 2019

No worries, I was thinking I'd branch from this, merge master, then open a PR from the new branch. Something like:

Master          + - + - -  ...  - +       +
                     \             \     /
This branch           + - ... - +   \   /
                                 \   \ /
My branch                         + - +

Looks like that might be mostly solved though. Thanks for the update @ombschervister!

Any chance there's an eta on when this might get into a release? I'm trying to figure out how much effort I should put into working around this.

@jakirkham
Copy link
Member

Go for it 👍

@ombschervister
Copy link
Author

@jrbourbeau
conflicts resolved

@ivirshup
Copy link

@ombschervister, just gave this a try. I noticed some errors if tried to create a zarr array using this kind of dtype via create_dataset. I did a bit of digging and think I've got a bit of a handle on what's happening. Basically, zarr.empty passes fill_value=None to create, while Group.create_dataset passes fill_value=0. This leads to problems downstream with encoding and decoding. Here's a proof of concept:

import numpy as np
import zarr

t = [("a", object), ("b", object)]
a = np.array([("lorem", "ipsum"), ("dolor", "sit"), ("amet", "consectetur")], dtype=t)

z = zarr.open({})
z.create("d1", shape=a.shape, dtype=t, fill_value=None, object_codec=zarr.Pickle())
z.create("d2", shape=a.shape, dtype=t, fill_value=0, object_codec=zarr.Pickle())
Traceback
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/github/zarr-python/zarr/meta.py in decode_array_metadata(s)
     41         dtype = decode_dtype(meta['dtype'])
---> 42         fill_value = decode_fill_value(meta['fill_value'], dtype)
     43         meta = dict(

~/github/zarr-python/zarr/meta.py in decode_fill_value(v, dtype)
    154         v = base64.standard_b64decode(v)
--> 155         v = np.array(v, dtype=dtype.str).view(dtype)[()]
    156         return v

/usr/local/lib/python3.7/site-packages/numpy/core/_internal.py in _view_is_safe(oldtype, newtype)
    460     if newtype.hasobject or oldtype.hasobject:
--> 461         raise TypeError("Cannot change data-type for object array.")
    462     return

TypeError: Cannot change data-type for object array.

During handling of the above exception, another exception occurred:

MetadataError                             Traceback (most recent call last)
<ipython-input-2-3debf3edfea5> in <module>
----> 1 z.create("d2", shape=a.shape, dtype=t, fill_value=0, object_codec=zarr.Pickle())

~/github/zarr-python/zarr/hierarchy.py in create(self, name, **kwargs)
    862         """Create an array. Keyword arguments as per
    863         :func:`zarr.creation.create`."""
--> 864         return self._write_op(self._create_nosync, name, **kwargs)
    865 
    866     def _create_nosync(self, name, **kwargs):

~/github/zarr-python/zarr/hierarchy.py in _write_op(self, f, *args, **kwargs)
    641 
    642         with lock:
--> 643             return f(*args, **kwargs)
    644 
    645     def create_group(self, name, overwrite=False):

~/github/zarr-python/zarr/hierarchy.py in _create_nosync(self, name, **kwargs)
    869         kwargs.setdefault('cache_attrs', self.attrs.cache)
    870         return create(store=self._store, path=path, chunk_store=self._chunk_store,
--> 871                       **kwargs)
    872 
    873     def empty(self, name, **kwargs):

~/github/zarr-python/zarr/creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, **kwargs)
    122     # instantiate array
    123     z = Array(store, path=path, chunk_store=chunk_store, synchronizer=synchronizer,
--> 124               cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only)
    125 
    126     return z

~/github/zarr-python/zarr/core.py in __init__(self, store, path, read_only, chunk_store, synchronizer, cache_metadata, cache_attrs)
    122 
    123         # initialize metadata
--> 124         self._load_metadata()
    125 
    126         # initialize attributes

~/github/zarr-python/zarr/core.py in _load_metadata(self)
    139         """(Re)load metadata from store."""
    140         if self._synchronizer is None:
--> 141             self._load_metadata_nosync()
    142         else:
    143             mkey = self._key_prefix + array_meta_key

~/github/zarr-python/zarr/core.py in _load_metadata_nosync(self)
    154 
    155             # decode and store metadata as instance members
--> 156             meta = decode_array_metadata(meta_bytes)
    157             self._meta = meta
    158             self._shape = meta['shape']

~/github/zarr-python/zarr/meta.py in decode_array_metadata(s)
     52         )
     53     except Exception as e:
---> 54         raise MetadataError('error decoding metadata: %s' % e)
     55     else:
     56         return meta

MetadataError: error decoding metadata: Cannot change data-type for object array.

This error would also be thrown if I called: z.create_dataset("d2", shape=a.shape, dtype=t, object_codec=zarr.Pickle()). I think are due to the default fill value. In particular these lines:

zarr-python/zarr/util.py

Lines 238 to 241 in 3964f10

elif fill_value == 0:
# this should be compatible across numpy versions for any array type, including
# structured arrays
fill_value = np.zeros((), dtype=dtype)[()]

zarr-python/zarr/meta.py

Lines 153 to 156 in 58b1786

elif dtype.kind == 'V':
v = base64.standard_b64decode(v)
v = np.array(v, dtype=dtype.str).view(dtype)[()]
return v

In short, you can't call .view on a numpy array whose dtype contains object, which is what decode_fill_value tries to do unless fill_value is None. I'm not sure the value is being encoded/ decoded well either. My impression is either non-None fill values can't be allowed for structured dtypes with objects, or there needs to be a different way to encode and decode fill values for this kind of dtype.

@joshmoore
Copy link
Member

Migrated to #702

@joshmoore joshmoore closed this Feb 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Numpy structured arrays containing an object
7 participants