You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User has an AnnData obs with a string-valued column with a large number of distinct values (i.e. there is no reason to have this be categorical)
They do AnnData adata.write_h5ad(...)( and later adata = anndata.read_h5ad(...)
AnnData has force-converted this to a categorical within the read-back adata.obs Pandas DataFrame -- before we ever see it
Then, at tiledbsoma.io.from_anndata, we are too trusting: given a categorical column in AnnData adata.obs, we make a categorical column in TileDB-SOMA
For cases like 4M distinct values in a categorical column, this faithful/trusting approach is a performance negative
Solution:
The tiledbsoma.io ingest code should inspect each categorical column, and if the number of categories exceeds some threshold (212 and 215-1 have both been proposed), store it as a simple primitive typed column, i.e, with no enumerations.
One negative is that outgesting back to AnnData would produce a string column, while the input had categorical. On discussion we believe:
Cases of huge enumerations are more likely to be mistakes than intentional
If we were to keep a metadata 'outgest hint' for tiledbsoma dataframe string columns, saying, convert back to categorical on outgest -- this would be prohibitive if the input were ordered. And if we can't do it for ordered, we shouldn't do it for non-ordred (that would be confusing).
The text was updated successfully, but these errors were encountered:
johnkerl
changed the title
[python] Large enumerations in tiledbsoma.io.from_anndata (performance issue)
[python] Performance hit for large enums in tiledbsoma.io.from_anndataNov 20, 2024
johnkerl
changed the title
[python] Performance hit for large enums in tiledbsoma.io.from_anndata
[python] Performance hit for large enums on ingest from AnnData
Nov 20, 2024
johnkerl
changed the title
[python] Performance hit for large enums on ingest from AnnData
[python] Fix perf. hit for large enums on ingest from AnnData
Nov 20, 2024
Scenario:
obs
with a string-valued column with a large number of distinct values (i.e. there is no reason to have this be categorical)adata.write_h5ad(...)(
and lateradata = anndata.read_h5ad(...)
adata.obs
Pandas DataFrame -- before we ever see ittiledbsoma.io.from_anndata
, we are too trusting: given a categorical column in AnnDataadata.obs
, we make a categorical column in TileDB-SOMASolution:
tiledbsoma.io
ingest code should inspect each categorical column, and if the number of categories exceeds some threshold (212 and 215-1 have both been proposed), store it as a simple primitive typed column, i.e, with no enumerations.[sc-59407] [sc-59595]
The text was updated successfully, but these errors were encountered: