Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Fix perf. hit for large enums on ingest from AnnData #3353

Open
johnkerl opened this issue Nov 20, 2024 · 0 comments
Open

[python] Fix perf. hit for large enums on ingest from AnnData #3353

johnkerl opened this issue Nov 20, 2024 · 0 comments

Comments

@johnkerl
Copy link
Member

johnkerl commented Nov 20, 2024

Scenario:

  • User has an AnnData obs with a string-valued column with a large number of distinct values (i.e. there is no reason to have this be categorical)
  • They do AnnData adata.write_h5ad(...)( and later adata = anndata.read_h5ad(...)
  • AnnData has force-converted this to a categorical within the read-back adata.obs Pandas DataFrame -- before we ever see it
  • Then, at tiledbsoma.io.from_anndata, we are too trusting: given a categorical column in AnnData adata.obs, we make a categorical column in TileDB-SOMA
  • For cases like 4M distinct values in a categorical column, this faithful/trusting approach is a performance negative

Solution:

  • The tiledbsoma.io ingest code should inspect each categorical column, and if the number of categories exceeds some threshold (212 and 215-1 have both been proposed), store it as a simple primitive typed column, i.e, with no enumerations.
  • One negative is that outgesting back to AnnData would produce a string column, while the input had categorical. On discussion we believe:
    • Cases of huge enumerations are more likely to be mistakes than intentional
    • If we were to keep a metadata 'outgest hint' for tiledbsoma dataframe string columns, saying, convert back to categorical on outgest -- this would be prohibitive if the input were ordered. And if we can't do it for ordered, we shouldn't do it for non-ordred (that would be confusing).

[sc-59407] [sc-59595]

@johnkerl johnkerl changed the title [python] Large enumerations in tiledbsoma.io.from_anndata (performance issue) [python] Performance hit for large enums in tiledbsoma.io.from_anndata Nov 20, 2024
@johnkerl johnkerl self-assigned this Nov 20, 2024
@johnkerl johnkerl changed the title [python] Performance hit for large enums in tiledbsoma.io.from_anndata [python] Performance hit for large enums on ingest from AnnData Nov 20, 2024
@johnkerl johnkerl changed the title [python] Performance hit for large enums on ingest from AnnData [python] Fix perf. hit for large enums on ingest from AnnData Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant