Zarr structured arrays with strings #249

ivirshup · 2019-11-15T06:21:55Z

AFAIK the last blocking bug for 0.7. Originally reported in scverse/scanpy#832.

Minimal reproducer:

import scanpy as sc
pbmc = sc.datasets.pbmc68k_reduced()
pbmc.write("tmp.h5ad")
fromdisk = sc.read("tmp.h5ad")  # Do we read okay
fromdisk.write(pbmc)  # Can we round trip

The issue here is with structured numpy arrays and the variety of string types, and didn't get caught earlier because these are a bit of pain to actually instantiate... A brief summary of the conflict (copied from the earlier issue:

h5py doesn't do fixed length unicode strings
h5py does do variable length unicode strings, pretty much anywhere
zarr doesn't do variable length strings in structured arrays
We probably don't actually want to use fixed length unicode strings much. Bytestrings, more likely.
We can probably just add another element type to allow special handling for these. I think it'd be fine to not do np.str_ type arrays.

This is pretty easy to fix for hdf5 if we just say all unicode strings are variable length. Zarr has an open pull request to support this zarr-developers/zarr-python#422.
The question is whether we wait for a zarr release to keep consistency between the formats. This is the simplest solution, and probably what we should go with once it's available. The problem is we end up with some intermediary solution if it's not available yet, which adds complexity to backwards compatibility.

ivirshup mentioned this issue Nov 18, 2019

Fix structured array IO #252

Merged

flying-sheep closed this as completed in #252 Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr structured arrays with strings #249

Zarr structured arrays with strings #249

ivirshup commented Nov 15, 2019

Zarr structured arrays with strings #249

Zarr structured arrays with strings #249

Comments

ivirshup commented Nov 15, 2019