-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't read or write h5ad files that contain booleans columns with nulls (None) #1258
Comments
Thanks for opening the issue. In the first case it looks like pandas isn't actually inferring the dtype, so this is somewhat expected. In contrast, this works: import anndata
import numpy as np
import pandas as pd
print(anndata.__version__)
# '0.10.3'
adata = anndata.AnnData(
X=None,
obs=pd.DataFrame({
"test_bool_null": pd.array([True, False, None, False]),
}),
)
adata.write_h5ad("test.h5ad")
anndata.read_h5ad("test.h5ad").obs
I believe this doesn't get infered with the code you wrote since pandas currently marks their nullable boolean type as experimental (docs). We could probably think about trying to infer this at write time.
Which R interface? Since I can read a nullable boolean array written by this library, I can't address this without more information. At the least, info about which library wrote it and ideally a demonstration file. That is a curious error though. |
Thanks @ivirshup , a minimal example of reproducing this issue using anndataR also just installed the recent version of anndataR from github, so the version info is 0.99.0 for this package. Create an library(anndataR)
ad <- AnnData(
X = matrix(1:15, 3L, 5L),
obs = data.frame(cell = 1:3, bool_null = c(NA, NA, TRUE)),
var = data.frame(gene = 1:5),
obs_names = LETTERS[1:3],
var_names = letters[1:5]
)
write_h5ad(ad, path = "test_bool_R.h5ad") trying to read this in Python anndata.read_h5ad("test_bool_R.h5ad") Traceback:---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[4], line 1
----> 1 anndata.read_h5ad("test_bool_R.h5ad")
File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:254, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
251 return read_dataframe(elem)
252 return func(elem)
--> 254 adata = read_dispatched(f, callback=callback)
256 # Backwards compat (should figure out which version)
257 if "raw.X" in f:
File ~/.local/lib/python3.9/site-packages/anndata/experimental/_dispatch_io.py:46, in read_dispatched(elem, callback)
42 from anndata._io.specs import _REGISTRY, Reader
44 reader = Reader(_REGISTRY, callback=callback)
---> 46 return reader.read_elem(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
203 break
204 try:
--> 205 return func(*args, **kwargs)
206 except Exception as e:
207 add_key_note(e, elem, elem.name, "read")
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:249, in Reader.read_elem(self, elem, modifiers)
247 read_func = partial(read_func, _reader=self)
248 if self.callback is not None:
--> 249 return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
250 else:
251 return read_func(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:235, in read_h5ad.<locals>.callback(func, elem_name, elem, iospec)
232 def callback(func, elem_name: str, elem, iospec):
233 if iospec.encoding_type == "anndata" or elem_name.endswith("/"):
234 return AnnData(
--> 235 **{
236 # This is covering up backwards compat in the anndata initializer
237 # In most cases we should be able to call `func(elen[k])` instead
238 k: read_dispatched(elem[k], callback)
239 for k in elem.keys()
240 if not k.startswith("raw.")
241 }
242 )
243 elif elem_name.startswith("/raw."):
244 return None
File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:238, in <dictcomp>(.0)
232 def callback(func, elem_name: str, elem, iospec):
233 if iospec.encoding_type == "anndata" or elem_name.endswith("/"):
234 return AnnData(
235 **{
236 # This is covering up backwards compat in the anndata initializer
237 # In most cases we should be able to call `func(elen[k])` instead
--> 238 k: read_dispatched(elem[k], callback)
239 for k in elem.keys()
240 if not k.startswith("raw.")
241 }
242 )
243 elif elem_name.startswith("/raw."):
244 return None
File ~/.local/lib/python3.9/site-packages/anndata/experimental/_dispatch_io.py:46, in read_dispatched(elem, callback)
42 from anndata._io.specs import _REGISTRY, Reader
44 reader = Reader(_REGISTRY, callback=callback)
---> 46 return reader.read_elem(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
203 break
204 try:
--> 205 return func(*args, **kwargs)
206 except Exception as e:
207 add_key_note(e, elem, elem.name, "read")
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:249, in Reader.read_elem(self, elem, modifiers)
247 read_func = partial(read_func, _reader=self)
248 if self.callback is not None:
--> 249 return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
250 else:
251 return read_func(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:251, in read_h5ad.<locals>.callback(func, elem_name, elem, iospec)
248 return _read_raw(f, as_sparse, rdasp)
249 elif elem_name in {"/obs", "/var"}:
250 # Backwards compat
--> 251 return read_dataframe(elem)
252 return func(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:313, in read_dataframe(group)
311 return read_dataframe_legacy(group)
312 else:
--> 313 return read_elem(group)
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:341, in read_elem(elem)
329 def read_elem(elem: StorageType) -> Any:
330 """
331 Read an element from a store.
332
(...)
339 The stored element.
340 """
--> 341 return Reader(_REGISTRY).read_elem(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
203 break
204 try:
--> 205 return func(*args, **kwargs)
206 except Exception as e:
207 add_key_note(e, elem, elem.name, "read")
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:251, in Reader.read_elem(self, elem, modifiers)
249 return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
250 else:
--> 251 return read_func(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:694, in read_dataframe(elem, _reader)
691 columns = list(_read_attr(elem.attrs, "column-order"))
692 idx_key = _read_attr(elem.attrs, "_index")
693 df = pd.DataFrame(
--> 694 {k: _reader.read_elem(elem[k]) for k in columns},
695 index=_reader.read_elem(elem[idx_key]),
696 columns=columns if len(columns) else None,
697 )
698 if idx_key != "_index":
699 df.index.name = idx_key
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:694, in <dictcomp>(.0)
691 columns = list(_read_attr(elem.attrs, "column-order"))
692 idx_key = _read_attr(elem.attrs, "_index")
693 df = pd.DataFrame(
--> 694 {k: _reader.read_elem(elem[k]) for k in columns},
695 index=_reader.read_elem(elem[idx_key]),
696 columns=columns if len(columns) else None,
697 )
698 if idx_key != "_index":
699 df.index.name = idx_key
File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
203 break
204 try:
--> 205 return func(*args, **kwargs)
206 except Exception as e:
207 add_key_note(e, elem, elem.name, "read")
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:251, in Reader.read_elem(self, elem, modifiers)
249 return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
250 else:
--> 251 return read_func(elem)
File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:852, in read_nullable_boolean(elem, _reader)
848 @_REGISTRY.register_read(H5Group, IOSpec("nullable-boolean", "0.1.0"))
849 @_REGISTRY.register_read(ZarrGroup, IOSpec("nullable-boolean", "0.1.0"))
850 def read_nullable_boolean(elem, _reader):
851 if "mask" in elem:
--> 852 return pd.arrays.BooleanArray(
853 _reader.read_elem(elem["values"]), mask=_reader.read_elem(elem["mask"])
854 )
855 else:
856 return pd.array(_reader.read_elem(elem["values"]))
File /apps/user/gpy/envs/dev/GPy39/lib/python3.9/site-packages/pandas/core/arrays/boolean.py:299, in BooleanArray.__init__(self, values, mask, copy)
295 def __init__(
296 self, values: np.ndarray, mask: np.ndarray, copy: bool = False
297 ) -> None:
298 if not (isinstance(values, np.ndarray) and values.dtype == np.bool_):
--> 299 raise TypeError(
300 "values should be boolean numpy array. Use "
301 "the 'pd.array' function instead"
302 )
303 self._dtype = BooleanDtype()
304 super().__init__(values, mask, copy=copy)
TypeError: values should be boolean numpy array. Use the 'pd.array' function instead
Error raised while reading key '/obs/bool_null' of <class 'h5py._hl.group.Group'> to / |
I think this may be an issue with anndataR, where it's not writing the array in a way that @rcannood / @lazappi wdyt? Is this feature meant to work already in anndatar? I believe the issue is that anndata is expecting h5py to recognize an enumerated boolean type, which I thought rhdf5 had implemented. |
I almost wonder if it has to do with import h5py
import pandas as pd
import numpy as np
file = h5py.File("./test_bool_R.h5ad")
file["obs/bool_null"]["values"] # <HDF5 dataset "values": shape (3,), type "|i1">
file["obs/bool_null"]["values"][:] # array([0, 0, 1], dtype=int8)
vals = np.array(file["obs/bool_null"]["values"][:], dtype=bool)
# array([False, False, True])
file["obs/bool_null"]["mask"] # <HDF5 dataset "mask": shape (3,), type "|i1">
mask=np.array(file["obs/bool_null"]["mask"][:], dtype=bool)
# array([ True, True, False])
pd.arrays.BooleanArray(vals, mask=mask)
# <BooleanArray>
# [<NA>, <NA>, True]
# Length: 3, dtype: boolean |
For now, i'm going to cast these vectors to satisfy pandas - jkanche@a16d924 |
@jkanche @ivirshup boolean enums are not implemented in rhdf5 yet, see grimbough/rhdf5#136. There are quite a few other issues that we're in the process of resolving. As long as anndataR isn't yet released, I can't recommend you using it just yet. If you'd like to use an anndata-like interface in R, I could use anndata at CRAN for now. |
Thanks for the input! I would say the writing from R is an issue in the anndataR package rather than here. Maybe we could broaden what's allowed in the future (e.g. try to interpret any byte width integers as bool) but the spec does say boolean array right now. I think us inferring nullable boolean types is more of a feature request. I will add that to some other planned work about inferring data types better. |
The R {anndata} part might also be related to the {rhdf5} version installed. More enum support was only added in the latest release (there is still the issue with attributes linked above though). |
Please make sure these conditions are met
Report
Tested this on both anndata version 0.8.0 and the latest 0.10.3 releases.
Code:
Traceback:
Similar issue if the file already contains a boolean array with null values (written through the R interface)
Traceback:
Versions
The text was updated successfully, but these errors were encountered: