SparseDataset().append() unexpected behavior #453

Abdul-Moeed · 2020-11-14T19:50:12Z

I am trying to append sparse datasets to one (large) dataset. It works as expected for a while (~2147483647 non-negative elements) but then stops appending non-negative elements without any errors/warnings and fills everything afterwards with 0s.

The problem is even stranger: while investigating append(), I found the following:

In script:

X = SparseDataset()
print(f"Pre-append nnz count: {X[:, :].nnz}")
X.append(new_x)
print(f"Post-append nnz count: {X[:, :].nnz}")

Output:
Pre-append nnz count: 2147483647
Post-append nnz count: 2147483647

In append():

data = self.group["data"]
.
.
.
data[orig_data_size:] = sparse_matrix.data
print(f"append() nnz count: {data.shape[0]}")

Output:
append() nnz count: 3359580869

So apparently the object X in my script has the same nnz count before and after append(), but inside append() the 'nnz' count is correct i.e. the new data is appended successfully.

The text was updated successfully, but these errors were encountered:

ivirshup · 2020-11-16T02:45:21Z

Not sure what exactly is going on, but it probably has to do with the indices and indptr being stored as 32 bit integers (since 2147483647 is the maximum value for int32s).

A few things things to investigate/ do:

Does storing the indices and intptrs as 64 bit integers solve this (will scipy try to downcast these?)
We should be bounds checking this
- Should this be an error, or an implicit upcast?

Abdul-Moeed · 2020-11-16T18:51:22Z

I try to cast indptr and indices as int64 but can't seem to do so, because SparseDataset() does not have these as members.

adata.X = scipy.sparse.csr_matrix(adata.X)
adata.X.indptr = adata.X.indptr.astype(np.int64)

Output: AttributeError: 'SparseDataset' object has no attribute 'indptr'

SparseDataset does have a group attribute which is how append() gets data, indptr and indices. But when I try to change data type there, I get this error:

Code:

self.group["indices"] = self.group["indices"].astype(np.int64)

Trace:

Traceback (most recent call last):
  File "d:\box\programs\pythonprojects\anndata\anndata\_core\sparse_dataset.py", line 360, in append
    self.group["indices"] = self.group["indices"].astype(np.int64)
  File "D:\Box\Programs\Anaconda\lib\site-packages\h5py\_hl\group.py", line 387, in __setitem__
    ds = self.create_dataset(None, data=obj, dtype=base.guess_dtype(obj))
  File "D:\Box\Programs\Anaconda\lib\site-packages\h5py\_hl\group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "D:\Box\Programs\Anaconda\lib\site-packages\h5py\_hl\dataset.py", line 118, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py\h5t.pyx", line 1634, in h5py.h5t.py_create
  File "h5py\h5t.pyx", line 1656, in h5py.h5t.py_create
  File "h5py\h5t.pyx", line 1711, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

ivirshup · 2020-11-17T08:37:50Z

But when I try to change data type there, I get this error:

Hmm, wasn't expecting that, but I've opened an issue at h5py for it h5py/h5py#1761

I think a quick solution here will be to make sure the growing SparseDataset is initialized with 64 bit indices and indptrs. For example, you could do something like:

import h5py
from scipy import sparse
import numpy as np

from anndata._io.h5ad import write_attribute
from anndata._core.sparse_dataset import SparseDataset

f = h5py.File("text.h5", "w")
s = sparse.random(100, 200, format="csr")

s.indices = s.indices.astype(np.int64)
s.intptr = s.indptr.astype(np.int64)

write_attribute(f, "base_array", s)
s_dset = SparseDataset(f["base_array"])

# Now do your appending
s_dset.append(sparse.random(200, 200, format="csr"))

Abdul-Moeed · 2020-11-17T18:15:29Z

Thanks! Would you know whether there's a way to do this with an AnnData() object? I make use of obs later on, as well as load this anndata as a backed object for downstream tasks.

Due to switching this dataset to a backed object, the matrix becomes dense. Casting the matrix back to scipy.sparse.csr_matrix converts it to anndata's sparse dataset. I'm not sure where in this procedure I can cast the indices and indptr as int64:

Code:

adata = anndata.AnnData(scipy.sparse.csr_matrix((0, n), dtype=np.float32))
print(f"Type 1: {type(adata.X)}")
# switch to backed object
adata.filename = filename
print(f"Type 2: {type(adata.X)}")
adata.X = scipy.sparse.csr_matrix(adata.X)
print(f"Type 3: {type(adata.X)}")

Output:

Type 1:  <class 'scipy.sparse.csr.csr_matrix'>
Type 2:  <class 'h5py._hl.dataset.Dataset'>
Type 3:  <class 'anndata._core.sparse_dataset.SparseDataset'>

ivirshup · 2020-11-18T05:16:27Z

You can set the datatypes of the indices and indptr on the matrix in the anndata object (i.e. adata.X.indices = adata.X.indices.astype(np.float64)), and then write it. So something like

adata.X.indices = adata.X.indices.astype(np.int64)
adata.X.indptr = adata.X.indptr.astype(np.int64)
adata.write_h5ad("path.h5ad")
backed = ad.read_h5ad("path.h5ad", backed="r")

A few points of caution:

It seems like scipy will cast the indices and indptr back to int32 pretty easily unless you have enough points (e.g. X.copy().indices.dtype can be int32). This is weird design, but something to watch out for.
backed mode is not as well developed as the rest of the library, and you may run into limitations quite quickly. What are you trying to do?

I'm pretty surprised by the results you get from print(f"Type 3: {type(adata.X)}"). That's not what I get, and isn't what I would expect.

Abdul-Moeed · 2020-11-18T10:53:31Z

That's strange, here's a screenshot:

I'm on version 0.7.5, perhaps you're on a different version.

As you can see from the 3rd print statement, I get an anndata._core.sparse_dataset.SparseDataset object. Casting indices and indptr for this object throws the error I mentioned before:

Output: AttributeError: 'SparseDataset' object has no attribute 'indptr'

I'm basically trying to combine multiple datasets into one large one for training models. Because the (combined) dataset is so large, I have to store it as a backed file so I don't load it all in memory at once during training.

ivirshup · 2020-11-19T05:02:33Z

Ah, I see what's happening now. This is part of the current backed interface which I think is a bit confusing. When you call sparse.csr_matrix(adata.X), the entire dense matrix is being read into memory, then converted to an in-memory sparse matrix. When that value is assigned to adata.X it's written to disk, since adata is in backed mode. Here's a demonstration:

import anndata as ad
from scipy import sparse

a = ad.AnnData(sparse.random(50, 20, format="csr", density=0.1))
print(type(a.X))
# <class 'scipy.sparse.csr.csr_matrix'>
a.filename = "test_backed.h5ad"
print(type(a.X))
# <class 'h5py._hl.dataset.Dataset'>
X = sparse.csr_matrix(a.X)
print(type(X))
# <class 'scipy.sparse.csr.csr_matrix'>
a.X = X
print(type(a.X))
# <class 'anndata._core.sparse_dataset.SparseDataset'>

Setting adata.filename will always write X as a dense array, since we can do more kinds of operations on backed dense arrays. The preferred API is to get a backed anndata by starting reading a file in as backed (e.g. anndata.read_h5ad(..., backed="r")).

As for your use case, that sounds good. There just aren't many operations defined on SparseDatasets and indexing along the non-compressed axis will be very slow.

I think we'll have a more polished API for this use case soon. Would you mind if I pinged you to take a look once it's ready for some more eyes on it?

Abdul-Moeed · 2020-11-19T17:53:21Z

Thanks a lot, it works now! And feel free to reach me once the new API is in place.

ivirshup · 2020-11-20T04:01:26Z

Great! Glad to help.

For the background on this: scverse/anndata#453

* add plot_npc and plot_active_latent_units (#9) * add plot_npc and plot_active_latent_units * make sure handling of z and z_mean is consistent for VAE embeddings * clean up and documentation * formatting Co-authored-by: Martin König <martin.koenig@ymail.com> Co-authored-by: le-ander <20015434+le-ander@users.noreply.github.com> * added data loader for interactive workflows with unprocessed data * made cell type loading optional in dataset .load() * enabled usage of type estimator on data without labels in prediction mode * recursively search custom model repo for weights files * sort model lookuptable alphabetically before writing it * make sure mode_path is set correctly in model_lookuptable when recursive weights loading is used * fix os.path.join usage in dataloaders * replace path handling through string concatenations with os.paths.join and f-strings * fix bug in lookup table writing * add mdoel file path to lookup table * reset index in model lookuptable before saving * add method to user interface for pushing local model weights to zenodo * fix bug in user interface * fix bux in summaries.py * use absolute model paths when model_lookuptable is used * fix bug in pretrained weights loading * fix bug in pretrained weights loading * automatically create an InteractiveDataset when loading data through the UI * fix bug inUI data loading * Explicitly cast indices and indptr of final backed file to int64. (#17) For the background on this: scverse/anndata#453 * update human lung dataset doi * align mouse organ names with human organ names * fix typo in trachea organ naming in mouse * rename mouse ovary organ to femalegonad * rename mouse ovary organ to femalegonad * sort by model type in classwise f1 heatmap plot * another hacky solution to ensure a summary tab can be created when both vae and other models are loaded at once * allow custom metadata in zenodo submission * do not return doi but deposit url after depositing to zenodo sandbox as dois don't wrk on sandbox * updated model zoo description * recognise all .h5 and .data-0000... files as sfaira weights when constructing lookuptable * Update README.rst * Add selu activation and lecun_normal weight_init scheme for human VAEVAMP. (#19) * update sfaira erpo url and handle .h5 extension in model lookuptable id * add meta_data download information to all human dataloaders * updated docs * updated reference to README in docs * updated index * included reference to svensson et al data base in docs * fixed typo in docs * fixed typos in docs * restructured docs * fixed bug in reference roadmap in docs * updated data and model zoo description * added summary picture into index of docs * fixed typo in docs * updated summary panel * add badges to readme and docs index * updated summary panel (#20) * Doc updates (#21) * updated summary panel * fixed concept figure references * Doc updates (#22) * updated zoo panels * move from `import sfaira.api as sfaira` to `import sfaira` and from `import sfaira_extension.api as sfairae` to `import sfaira_extension` * add custom genomes to sfaira_extension * fix loading of custom topology versions from sfaira_extension * fix circular imports between sfaira_extension and sfaira * fix dataloader * fix celltype versioning through sfaira_extension * fix celltype versioning through sfaira_extension * formatting * Doc updates (#25) * added mention of download scripts into docs Co-authored-by: mk017 <martin.koenig@tum.de> Co-authored-by: Martin König <martin.koenig@ymail.com> Co-authored-by: le-ander <20015434+le-ander@users.noreply.github.com> Co-authored-by: Abdul Moeed <abdulmoeed444@gmail.com>

ivirshup closed this as completed Nov 20, 2020

le-ander pushed a commit to theislab/sfaira that referenced this issue Nov 20, 2020

Explicitly cast indices and indptr of final backed file to int64. (#17)

b54cdb5

For the background on this: scverse/anndata#453

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseDataset().append() unexpected behavior #453

SparseDataset().append() unexpected behavior #453

Abdul-Moeed commented Nov 14, 2020 •

edited

Loading

ivirshup commented Nov 16, 2020

Abdul-Moeed commented Nov 16, 2020

ivirshup commented Nov 17, 2020

Abdul-Moeed commented Nov 17, 2020 •

edited

Loading

ivirshup commented Nov 18, 2020 •

edited

Loading

Abdul-Moeed commented Nov 18, 2020

ivirshup commented Nov 19, 2020

Abdul-Moeed commented Nov 19, 2020

ivirshup commented Nov 20, 2020

SparseDataset().append() unexpected behavior #453

SparseDataset().append() unexpected behavior #453

Comments

Abdul-Moeed commented Nov 14, 2020 • edited Loading

ivirshup commented Nov 16, 2020

Abdul-Moeed commented Nov 16, 2020

ivirshup commented Nov 17, 2020

Abdul-Moeed commented Nov 17, 2020 • edited Loading

ivirshup commented Nov 18, 2020 • edited Loading

Abdul-Moeed commented Nov 18, 2020

ivirshup commented Nov 19, 2020

Abdul-Moeed commented Nov 19, 2020

ivirshup commented Nov 20, 2020

Abdul-Moeed commented Nov 14, 2020 •

edited

Loading

Abdul-Moeed commented Nov 17, 2020 •

edited

Loading

ivirshup commented Nov 18, 2020 •

edited

Loading