Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when writing table column with mixed string / NaN #399

Open
aeisenbarth opened this issue Nov 9, 2023 · 3 comments
Open

TypeError when writing table column with mixed string / NaN #399

aeisenbarth opened this issue Nov 9, 2023 · 3 comments

Comments

@aeisenbarth
Copy link
Contributor

When writing a SpatialData object which has a column containing string values or empty values, an error is raised.

Empty values are a very common use-case:

  • When two regions have fundamentally different annotations, there is no possible value to fill in for the other region. This is often represented as None or NaN. For example region "cells" could have column "cell_diameter" but region "well" cannot have "cell_diameter".
  • When concatenating tables where some columns existed only in either one of the source tables. Due to the single-table design, SpatialData needs to fill up these columns, which it does with NaN.
    This can also be a huge problem for integer columns which have no NaN. Pandas silently changes the dtype to float. Then whoever wants to read values from the supposed integer column must precautionarily convert to int to avoid follow up errors.

Example

from anndata import AnnData
from spatialdata import SpatialData
from spatialdata.models import TableModel
import numpy as np
import pandas as pd
import pytest
import shutil


@pytest.fixture
def sdata_with_nan_in_obs():
    table = TableModel.parse(
        AnnData(
            obs=pd.DataFrame(
                {
                    "region": ["region1", "region2"],
                    "instance": [0, 0],
                    "column_only_region1": ["string", np.nan],
                },
                index=pd.RangeIndex(0, 2).astype(str)
            ).astype({"region": "category"}),
        ),
        region_key="region",
        instance_key="instance",
        region=["region1", "region2"],
    )
    return SpatialData(table=table)


def test_sdata_with_nan_in_obs(sdata_with_nan_in_obs, tmp_path):
    sdata = sdata_with_nan_in_obs
    # Remove any pre-existing files for being able to repeat the test
    shutil.rmtree(tmp_path / "sdata_with_nan_in_obs.zarr", ignore_errors=True)
    # Raises "TypeError: expected unicode string, found nan"
    sdata.write(tmp_path / "sdata_with_nan_in_obs.zarr")

Backtrace

../../../../../spatialdata/src/spatialdata/_core/spatialdata.py:1052: in write
    raise e
../../../../../spatialdata/src/spatialdata/_core/spatialdata.py:1048: in write
    write_table(table=self.table, group=elem_group, name="table")
../../../../../spatialdata/src/spatialdata/_io/io_table.py:20: in write_table
    write_adata(group, name, table)  # creates group[name]
 …/site-packages/anndata/_io/specs/registry.py:353: in write_elem
    Writer(_REGISTRY).write_elem(store, k, elem, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/utils.py:248: in func_wrapper
    re_raise_error(e, elem, key)
 …/site-packages/anndata/_io/utils.py:246: in func_wrapper
    return func(*args, **kwargs)
 …/site-packages/anndata/_io/specs/registry.py:311: in write_elem
    return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/specs/registry.py:52: in wrapper
    result = func(g, k, *args, **kwargs)
 …/site-packages/anndata/_io/specs/methods.py:220: in write_anndata
    _writer.write_elem(g, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/utils.py:248: in func_wrapper
    re_raise_error(e, elem, key)
 …/site-packages/anndata/_io/utils.py:246: in func_wrapper
    return func(*args, **kwargs)
 …/site-packages/anndata/_io/specs/registry.py:311: in write_elem
    return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/specs/registry.py:52: in wrapper
    result = func(g, k, *args, **kwargs)
 …/site-packages/anndata/_io/specs/methods.py:579: in write_dataframe
    _writer.write_elem(
 …/site-packages/anndata/_io/utils.py:248: in func_wrapper
    re_raise_error(e, elem, key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

e = TypeError('expected unicode string, found nan')
elem = <zarr.hierarchy.Group '/table/table/obs'>, key = 'column_only_region1'

    def re_raise_error(e, elem, key):
        if "Above error raised while writing key" in format(e):
            raise
        else:
            parent = _get_parent(elem)
>           raise type(e)(
                f"{e}\n\n"
                f"Above error raised while writing key {key!r} of {type(elem)} "
                f"to {parent}"
            ) from e
E           TypeError: expected unicode string, found nan
E           
E           Above error raised while writing key 'column_only_region1' of <class 'zarr.hierarchy.Group'> to <zarr.storage.FSStore object at 0x7fba5b08cd30>

This is not directly caused by anndata, and interestingly, anndata modifies the table so that the column "column_only_region1" gets dtype "category" instead of object, and the error does not occur anymore:

def test_adata_with_nan_in_obs(sdata_with_nan_in_obs, tmp_path):
    adata = sdata_with_nan_in_obs.table
    # AnnData does not cause an error
    adata.write_zarr(tmp_path / "adata_with_nan_in_obs.zarr")
    # and modifies the table so that SpatialData does not raise the error anymore:
    shutil.rmtree(tmp_path / "sdata_with_nan_in_obs.zarr", ignore_errors=True)
    sdata_with_nan_in_obs.write(tmp_path / "sdata_with_nan_in_obs.zarr")
    # No error raised
@LucaMarconato
Copy link
Member

Hi! Thanks for reporting, I have just commented on #298, where you had also mentioned the typing problem in the presence of nan values.

As you also observed, the reported bug arises as a consequence of the single-table design. @melonora is going to work on multiple tables while @giovp and I will work on disk and in-memory representation.

Therefore, if this is viable for you, I would maybe suggest a workaround like using a placeholder value for "NaN" values so that the type is not affected. The plan is to have the table implemented (or in a good state) by the end of the year. If this would not work for your please let me know and I can try to find a better fix for this bug.

@aeisenbarth
Copy link
Contributor Author

Thanks! I was considering looking into creating a simple fix, but since the tables are undergoing major changes, it's probably better not to interfere with that. I have a workaround, so for me the issue is not solved.

For reference to anyone with the same issue:

def workaround_spatialdata_nan_in_str_columns(obs: pd.DataFrame):
    # When writing a SpatialData table (AnnData) which contains np.nan for missing/unspecified
    # values in string columns, it raises "expected unicode string, found nan"
    # See https://github.com/scverse/spatialdata/issues/399
    for column in obs.select_dtypes(include=[object]).columns:
        obs[column] = obs[column].astype("category")

@LucaMarconato
Copy link
Member

Thanks for sharing the workaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants