[feature request] Query conditions have unexpected behavior with enum attributes #1880

bkmartinjr · 2023-12-15T06:49:53Z

I am in process of converting a bunch of string attributes to use the new enum capability. I assumed the filter/query condition equality (== and in [...]) behavior would match that of the original string columns, but there is a significant change in behavior.

When a query condition on a string attribute fails to match, it returns an empty result. When a query condition is applied to a enum column, it generates an exception if the filter tries to test for equality with a string not present in the enum value list.

This effectively forces query condition users to know ahead of time what enum values are present, even if the end result of the operation is the same (an empty result).

At least in the Pandas ecosystem, equality operations on categoricals act as if they are operating on the value, e.g., an == operation is supported against a string, regardless if the dataframe column is a categorical-of-string or just plain string. String ordered comparison operations, such as <, are not supported in this manner for unoredered categories - just equality and related ops (e.g., in).

What I'd like to advocate for:

for unordered enums: support equality-like ops vs any string (e.g, "enum_col == 'foobar'"), returning an empty result if the string value is not in the defined enum values.
for ordered enums: also support lt, gt, etc. in the filters

Example of the unordered case:

In [16]: obs.schema
Out[16]: 
ArraySchema(
  domain=Domain(*[
    Dim(name='soma_joinid', domain=(0, 2147483646), tile=2048, dtype='int64', filters=FilterList([DoubleDeltaFilter(reinterp_dtype=None), ZstdFilter(level=19), ])),
  ]),
  attrs=[
    Attr(name='assay', dtype='int8', var=False, nullable=False, enum_label='assay', filters=FilterList([ZstdFilter(level=19), ])),
    Attr(name='observation_joinid', dtype='<U0', var=True, nullable=False, enum_label=None, filters=FilterList([ZstdFilter(level=19), ])),
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=65536,
  sparse=True,
  allows_duplicates=True,
)

In [17]: obs.enum('assay')
Out[17]: Enumeration(name='assay', cell_val_num=4294967295, ordered=False, values=["10x 3' v2", "10x 3' v3", "10x 5' v1", 'BD Rhapsody Targeted mRNA', 'Smart-seq2'])

In [18]: obs.query(cond="assay == 'foobar'").df[:]
---------------------------------------------------------------------------
TileDBError                               Traceback (most recent call last)
Cell In[18], line 1
----> 1 obs.query(cond="assay == 'foobar'").df[:]

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledb/multirange_indexing.py:256, in _BaseIndexer.__getitem__(self, idx)
    254     self.subarray = Subarray(self.array)
    255     self._set_ranges(idx)
--> 256 return self if self.return_incomplete else self._run_query()

File ~/cellxgene-census/venv-builder/lib/python3.10/site-packages/tiledb/multirange_indexing.py:399, in DataFrameIndexer._run_query(self)
    396 import pyarrow
    398 if self.pyquery is not None:
--> 399     self.pyquery.submit()
    401 if self.pyquery is None:
    402     df = pandas.DataFrame(self._empty_results)

TileDBError: TileDB internal: Enumeration value not found for field 'assay'

In [19]: obs.query(cond="observation_joinid == 'foobar'").df[:]
Out[19]: 
Empty DataFrame
Columns: [soma_joinid, dataset_id, assay, assay_ontology_term_id, cell_type, cell_type_ontology_term_id, development_stage, development_stage_ontology_term_id, disease, disease_ontology_term_id, donor_id, is_primary_data, observation_joinid, self_reported_ethnicity, self_reported_ethnicity_ontology_term_id, sex, sex_ontology_term_id, suspension_type, tissue, tissue_ontology_term_id, tissue_type, tissue_general, tissue_general_ontology_term_id, raw_sum, nnz, raw_mean_nnz, raw_variance_nnz, n_measured_vars]
Index: []

This on: TileDB-Py version 0.24.0, Linux

The text was updated successfully, but these errors were encountered:

bkmartinjr mentioned this issue Dec 15, 2023

[python/r] DataFrame: value filter on enum/dict column generates internal error when sought value not in enumeration single-cell-data/TileDB-SOMA#1988

Closed

johnkerl changed the title ~~[feature request] query conditions have unexpected behavior with enum attributes~~ [feature request] Query conditions have unexpected behavior with enum attributes Dec 15, 2023

johnkerl assigned nguyenv and johnkerl Dec 15, 2023

johnkerl added the bug label Dec 15, 2023

nguyenv mentioned this issue Dec 15, 2023

[2.21.0] Add tests to ensure empty result on query condition for invalid enum #1882

Merged

nguyenv linked a pull request Dec 18, 2023 that will close this issue

[2.21.0] Add tests to ensure empty result on query condition for invalid enum #1882

Merged

This was referenced Dec 18, 2023

[builder] port to use enums in schema chanzuckerberg/cellxgene-census#896

Merged

add enumerated/categorical support chanzuckerberg/cellxgene-census#604

Closed

nguyenv closed this as completed in #1882 Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Query conditions have unexpected behavior with enum attributes #1880

[feature request] Query conditions have unexpected behavior with enum attributes #1880

bkmartinjr commented Dec 15, 2023

[feature request] Query conditions have unexpected behavior with enum attributes #1880

[feature request] Query conditions have unexpected behavior with enum attributes #1880

Comments

bkmartinjr commented Dec 15, 2023