TypeError when writing table column with mixed string / NaN #399

aeisenbarth · 2023-11-09T14:32:47Z

When writing a SpatialData object which has a column containing string values or empty values, an error is raised.

Empty values are a very common use-case:

When two regions have fundamentally different annotations, there is no possible value to fill in for the other region. This is often represented as None or NaN. For example region "cells" could have column "cell_diameter" but region "well" cannot have "cell_diameter".
When concatenating tables where some columns existed only in either one of the source tables. Due to the single-table design, SpatialData needs to fill up these columns, which it does with NaN.
This can also be a huge problem for integer columns which have no NaN. Pandas silently changes the dtype to float. Then whoever wants to read values from the supposed integer column must precautionarily convert to int to avoid follow up errors.

Example

from anndata import AnnData
from spatialdata import SpatialData
from spatialdata.models import TableModel
import numpy as np
import pandas as pd
import pytest
import shutil


@pytest.fixture
def sdata_with_nan_in_obs():
    table = TableModel.parse(
        AnnData(
            obs=pd.DataFrame(
                {
                    "region": ["region1", "region2"],
                    "instance": [0, 0],
                    "column_only_region1": ["string", np.nan],
                },
                index=pd.RangeIndex(0, 2).astype(str)
            ).astype({"region": "category"}),
        ),
        region_key="region",
        instance_key="instance",
        region=["region1", "region2"],
    )
    return SpatialData(table=table)


def test_sdata_with_nan_in_obs(sdata_with_nan_in_obs, tmp_path):
    sdata = sdata_with_nan_in_obs
    # Remove any pre-existing files for being able to repeat the test
    shutil.rmtree(tmp_path / "sdata_with_nan_in_obs.zarr", ignore_errors=True)
    # Raises "TypeError: expected unicode string, found nan"
    sdata.write(tmp_path / "sdata_with_nan_in_obs.zarr")

Backtrace

../../../../../spatialdata/src/spatialdata/_core/spatialdata.py:1052: in write
    raise e
../../../../../spatialdata/src/spatialdata/_core/spatialdata.py:1048: in write
    write_table(table=self.table, group=elem_group, name="table")
../../../../../spatialdata/src/spatialdata/_io/io_table.py:20: in write_table
    write_adata(group, name, table)  # creates group[name]
 …/site-packages/anndata/_io/specs/registry.py:353: in write_elem
    Writer(_REGISTRY).write_elem(store, k, elem, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/utils.py:248: in func_wrapper
    re_raise_error(e, elem, key)
 …/site-packages/anndata/_io/utils.py:246: in func_wrapper
    return func(*args, **kwargs)
 …/site-packages/anndata/_io/specs/registry.py:311: in write_elem
    return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/specs/registry.py:52: in wrapper
    result = func(g, k, *args, **kwargs)
 …/site-packages/anndata/_io/specs/methods.py:220: in write_anndata
    _writer.write_elem(g, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/utils.py:248: in func_wrapper
    re_raise_error(e, elem, key)
 …/site-packages/anndata/_io/utils.py:246: in func_wrapper
    return func(*args, **kwargs)
 …/site-packages/anndata/_io/specs/registry.py:311: in write_elem
    return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)
 …/site-packages/anndata/_io/specs/registry.py:52: in wrapper
    result = func(g, k, *args, **kwargs)
 …/site-packages/anndata/_io/specs/methods.py:579: in write_dataframe
    _writer.write_elem(
 …/site-packages/anndata/_io/utils.py:248: in func_wrapper
    re_raise_error(e, elem, key)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

e = TypeError('expected unicode string, found nan')
elem = <zarr.hierarchy.Group '/table/table/obs'>, key = 'column_only_region1'

    def re_raise_error(e, elem, key):
        if "Above error raised while writing key" in format(e):
            raise
        else:
            parent = _get_parent(elem)
>           raise type(e)(
                f"{e}\n\n"
                f"Above error raised while writing key {key!r} of {type(elem)} "
                f"to {parent}"
            ) from e
E           TypeError: expected unicode string, found nan
E           
E           Above error raised while writing key 'column_only_region1' of <class 'zarr.hierarchy.Group'> to <zarr.storage.FSStore object at 0x7fba5b08cd30>

This is not directly caused by anndata, and interestingly, anndata modifies the table so that the column "column_only_region1" gets dtype "category" instead of object, and the error does not occur anymore:

def test_adata_with_nan_in_obs(sdata_with_nan_in_obs, tmp_path):
    adata = sdata_with_nan_in_obs.table
    # AnnData does not cause an error
    adata.write_zarr(tmp_path / "adata_with_nan_in_obs.zarr")
    # and modifies the table so that SpatialData does not raise the error anymore:
    shutil.rmtree(tmp_path / "sdata_with_nan_in_obs.zarr", ignore_errors=True)
    sdata_with_nan_in_obs.write(tmp_path / "sdata_with_nan_in_obs.zarr")
    # No error raised

The text was updated successfully, but these errors were encountered:

LucaMarconato · 2023-11-09T17:18:29Z

Hi! Thanks for reporting, I have just commented on #298, where you had also mentioned the typing problem in the presence of nan values.

As you also observed, the reported bug arises as a consequence of the single-table design. @melonora is going to work on multiple tables while @giovp and I will work on disk and in-memory representation.

Therefore, if this is viable for you, I would maybe suggest a workaround like using a placeholder value for "NaN" values so that the type is not affected. The plan is to have the table implemented (or in a good state) by the end of the year. If this would not work for your please let me know and I can try to find a better fix for this bug.

aeisenbarth · 2023-11-09T18:07:14Z

Thanks! I was considering looking into creating a simple fix, but since the tables are undergoing major changes, it's probably better not to interfere with that. I have a workaround, so for me the issue is not solved.

For reference to anyone with the same issue:

def workaround_spatialdata_nan_in_str_columns(obs: pd.DataFrame):
    # When writing a SpatialData table (AnnData) which contains np.nan for missing/unspecified
    # values in string columns, it raises "expected unicode string, found nan"
    # See https://github.com/scverse/spatialdata/issues/399
    for column in obs.select_dtypes(include=[object]).columns:
        obs[column] = obs[column].astype("category")

LucaMarconato · 2023-11-09T18:26:08Z

Thanks for sharing the workaround!

LucaMarconato closed this as completed Nov 13, 2023

LucaMarconato reopened this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError when writing table column with mixed string / NaN #399

TypeError when writing table column with mixed string / NaN #399

aeisenbarth commented Nov 9, 2023

LucaMarconato commented Nov 9, 2023

aeisenbarth commented Nov 9, 2023

LucaMarconato commented Nov 9, 2023

TypeError when writing table column with mixed string / NaN #399

TypeError when writing table column with mixed string / NaN #399

Comments

aeisenbarth commented Nov 9, 2023

LucaMarconato commented Nov 9, 2023

aeisenbarth commented Nov 9, 2023

LucaMarconato commented Nov 9, 2023