[python] Fix perf. hit for large enums on ingest from AnnData #3353

johnkerl · 2024-11-20T19:06:47Z

Scenario:

User has an AnnData obs with a string-valued column with a large number of distinct values (i.e. there is no reason to have this be categorical)
They do AnnData adata.write_h5ad(...)( and later adata = anndata.read_h5ad(...)
AnnData has force-converted this to a categorical within the read-back adata.obs Pandas DataFrame -- before we ever see it
Then, at tiledbsoma.io.from_anndata, we are too trusting: given a categorical column in AnnData adata.obs, we make a categorical column in TileDB-SOMA
For cases like 4M distinct values in a categorical column, this faithful/trusting approach is a performance negative

Solution:

The tiledbsoma.io ingest code should inspect each categorical column, and if the number of categories exceeds some threshold (212 and 215-1 have both been proposed), store it as a simple primitive typed column, i.e, with no enumerations.
One negative is that outgesting back to AnnData would produce a string column, while the input had categorical. On discussion we believe:
- Cases of huge enumerations are more likely to be mistakes than intentional
- If we were to keep a metadata 'outgest hint' for tiledbsoma dataframe string columns, saying, convert back to categorical on outgest -- this would be prohibitive if the input were ordered. And if we can't do it for ordered, we shouldn't do it for non-ordred (that would be confusing).

The text was updated successfully, but these errors were encountered:

johnkerl changed the title ~~[python] Large enumerations in tiledbsoma.io.from_anndata (performance issue)~~ [python] Performance hit for large enums in tiledbsoma.io.from_anndata Nov 20, 2024

johnkerl self-assigned this Nov 20, 2024

johnkerl added python-api performance labels Nov 20, 2024

johnkerl changed the title ~~[python] Performance hit for large enums in tiledbsoma.io.from_anndata~~ [python] Performance hit for large enums on ingest from AnnData Nov 20, 2024

johnkerl mentioned this issue Nov 20, 2024

[python] Protect against huge enum-of-strings input [WIP] #3354

Draft

johnkerl changed the title ~~[python] Performance hit for large enums on ingest from AnnData~~ [python] Fix perf. hit for large enums on ingest from AnnData Nov 20, 2024

johnkerl added the for-release-1.15 label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Fix perf. hit for large enums on ingest from AnnData #3353

[python] Fix perf. hit for large enums on ingest from AnnData #3353

johnkerl commented Nov 20, 2024 •

edited

Loading

[python] Fix perf. hit for large enums on ingest from AnnData #3353

[python] Fix perf. hit for large enums on ingest from AnnData #3353

Comments

johnkerl commented Nov 20, 2024 • edited Loading

johnkerl commented Nov 20, 2024 •

edited

Loading