Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(datatypes): manually cast the type of pos to int16 for table.info() #9139

Merged
merged 9 commits into from
Jun 5, 2024

Conversation

jitingxu1
Copy link
Contributor

Description of changes

when I was working on a data science project, it has more than 128 columns in the table, when I run table.info() it throws the error

RelationError                             Traceback (most recent call last)
Cell In[95], line 1
----> 1 static[cols].info()

File /opt/conda/lib/python3.10/site-packages/ibis/expr/types/relations.py:2829, in Table.info(self)
   2817     agg = self.select(
   2818         isna=ibis.case().when(col.isnull(), 1).else_(0).end()
   2819     ).agg(
   (...)
   2826         pos=lit(pos),
   2827     )
   2828     aggs.append(agg)
-> 2829 return ibis.union(*aggs).order_by(ibis.asc("pos"))

File /opt/conda/lib/python3.10/site-packages/ibis/expr/api.py:1948, in union(table, distinct, *rest)
   1882 def union(table: ir.Table, *rest: ir.Table, distinct: bool = False) -> ir.Table:
   1883     """Compute the set union of multiple table expressions.
   1884 
   1885     The input tables must have identical schemas.
   (...)
   1946 
   1947     """
-> 1948     return table.union(*rest, distinct=distinct) if rest else table

File /opt/conda/lib/python3.10/site-packages/ibis/expr/types/relations.py:1751, in Table.union(self, table, distinct, *rest)
   1749 node = ops.Union(self, table, distinct=distinct)
   1750 for table in rest:
-> 1751     node = ops.Union(node, table, distinct=distinct)
   1752 return node.to_expr().select(self.columns)

File /opt/conda/lib/python3.10/site-packages/ibis/common/bases.py:72, in AbstractMeta.__call__(cls, *args, **kwargs)
     52 def __call__(cls, *args, **kwargs):
     53     """Create a new instance of the class.
     54 
     55     The subclass may override the `__create__` classmethod to change the
   (...)
     70 
     71     """
---> 72     return cls.__create__(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/ibis/common/grounds.py:120, in Annotable.__create__(cls, *args, **kwargs)
    116 @classmethod
    117 def __create__(cls, *args: Any, **kwargs: Any) -> Self:
    118     # construct the instance by passing only validated keyword arguments
    119     kwargs = cls.__signature__.validate(cls, args, kwargs)
--> 120     return super().__create__(**kwargs)

File /opt/conda/lib/python3.10/site-packages/ibis/expr/operations/relations.py:293, in Set.__init__(self, left, right, **kwargs)
    290 def __init__(self, left, right, **kwargs):
    291     # convert to dictionary first, to get key-unordered comparison semantics
    292     if dict(left.schema) != dict(right.schema):
--> 293         raise RelationError("Table schemas must be equal for set operations")
    294     elif left.schema.names != right.schema.names:
    295         # rewrite so that both sides have the columns in the same order making it
    296         # easier for the backends to implement set operations
    297         cols = {name: Field(right, name) for name in left.schema.names}

RelationError: Table schemas must be equal for set operations

Here is the code could be used to reproduce the error,

 import ibis
   ...: ibis.options.interactive = True
   ...: df = {
   ...:     f"col_{i}": [1, 0] for i in range(129)
   ...: }
   ...: print(len(df.keys()))
   ...:  d = ibis.memtable(df)
   ...: d.info()

The schemas when union are different, The type of pos is int8 for the first 128 columns, it will be int16 after that. I cast all into int16 which is large enough for lots of columns.

dict(left.schema) = {
'name': String(nullable=True), 
'type': String(nullable=True), 
'nullable': Boolean(nullable=True),
 'nulls': Int64(nullable=True), 
'non_nulls': Int64(nullable=True), 
'null_frac': Float64(nullable=True), 
'pos': Int8(nullable=True)
}

dict(right.schema) = {
'name': String(nullable=True), 
'type': String(nullable=True), 
'nullable': Boolean(nullable=True), 
'nulls': Int64(nullable=True), 
'non_nulls': Int64(nullable=True), 
'null_frac': Float64(nullable=True), 
###### int16
'pos': Int16(nullable=True)
}

Issues closed

Copy link
Contributor

github-actions bot commented May 7, 2024

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

@jitingxu1 jitingxu1 requested a review from cpcloud May 7, 2024 01:41
@jitingxu1 jitingxu1 changed the title fix: Manually casting the pos column to Int16 fix: manually casting the type of pos to Int16 for table.info() May 7, 2024
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs a test. Something like:

  1. Create a table with > 127 columns
  2. Call the info() method
  3. Make some kind of assertion

@jitingxu1
Copy link
Contributor Author

jitingxu1 commented May 9, 2024

This PR needs a test. Something like:

  1. Create a table with > 127 columns
  2. Call the info() method
  3. Make some kind of assertion

tried to add test like this, the code could work when num_cols is small.

@pytest.mark.notyet(
    ["druid"],
    raises=PyDruidProgrammingError,
    reason="Druid only supports trivial unions",
)
def test_table_info_wider(con, backend):
    num_cols = 129
    col_names = [f"col_{i}"for i in range(num_cols)]
    t = ibis.memtable(
        {col: [0, 1] for col in col_names}
    )
    expr = t.info()
    result = expr.execute()
    assert list(result.name) == col_names

but got the RecursionError:when I have 129 columns, not sure if I create the table correctly, I did tried con.create_table() and ibis.memtable()
Do you have an idea on this?

Here is the log

        expr = t.info()
>       result = expr.execute()

ibis/backends/tests/test_generic.py:633: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ibis/expr/types/core.py:393: in execute
    return self._find_backend(use_default=True).execute(
ibis/backends/duckdb/__init__.py:1355: in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
ibis/backends/duckdb/__init__.py:1288: in _to_duckdb_relation
    sql = self.compile(table_expr, limit=limit, params=params)
ibis/backends/sql/__init__.py:178: in compile
    sql = query.sql(dialect=self.dialect, pretty=pretty, copy=False)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/expressions.py:561: in sql
    return Dialect.get_or_raise(dialect).generate(self, **opts)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/dialects/dialect.py:498: in generate
    return self.generator(**opts).generate(expression, copy=copy)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:583: in generate
    sql = self.sql(expression).strip()
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2230: in select_sql
    self.sql(expression, "from", comment=False),
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2316: in union_sql
    return self.set_operations(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2309: in set_operations
    sqls.append(self.sql(node))
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:2230: in select_sql
    self.sql(expression, "from", comment=False),
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:728: in sql
    return self.sql(value)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:739: in sql
    sql = getattr(self, exp_handler_name)(expression)
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:1800: in from_sql
    return f"{self.seg('FROM')} {self.sql(expression, 'this')}"
/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:728: in sql
    return self.sql(value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <sqlglot.dialects.duckdb.DuckDB.Generator object at 0x154de9910>
expression = Subquery(
  this=Union(
    this=Select(
      expressions=[
        Star()],
      from=From(
        this=Subquery(
...          alias=Identifier(this=t203, quoted=True)))),
    distinct=False),
  alias=Identifier(this=t313, quoted=True))
key = None, comment = True

    def sql(
        self,
        expression: t.Optional[str | exp.Expression],
        key: t.Optional[str] = None,
        comment: bool = True,
    ) -> str:
        if not expression:
            return ""
    
        if isinstance(expression, str):
            return expression
    
        if key:
            value = expression.args.get(key)
            if value:
                return self.sql(value)
            return ""
    
        transform = self.TRANSFORMS.get(expression.__class__)
    
>       if callable(transform):
E       RecursionError: maximum recursion depth exceeded while calling a Python object

/Users/claypot/miniconda3/envs/ibis-dev-arm64/lib/python3.11/site-packages/sqlglot/generator.py:733: RecursionError
!!! Recursion error detected, but an error occurred locating the origin of recursion.
  The following exception happened when comparing locals in the stack frame:
    RecursionError: maximum recursion depth exceeded
  Displaying first and last 10 stack frames out of 964.
========================================================================================================== short test summary info ===========================================================================================================
FAILED ibis/backends/tests/test_generic.py::test_table_info_wider[duckdb] - RecursionError: maximum recursion depth exceeded while calling a Python object

@jitingxu1 jitingxu1 requested review from gforsyth and cpcloud May 31, 2024 21:39
@jitingxu1
Copy link
Contributor Author

I was waiting for the #9194

Now added the unit test, ping @cpcloud for review.

@cpcloud
Copy link
Member

cpcloud commented Jun 1, 2024

Seems like flink can't handle the new test. Can you mark it as notyet or broken?

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are still failing for Flink.

@jitingxu1
Copy link
Contributor Author

Tests are still failing for Flink.

Could we re run the test? The logs:

Caused by: java.io.IOException: Insufficient number of network buffers: required 512, but only 348 available. The total number of network buffers is currently set to 20316 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max'.

@cpcloud
Copy link
Member

cpcloud commented Jun 3, 2024

Not really, because it's a repeatable failure.

It's almost certainly related to the "large" number of columns (not actually that large).

Can you instead xfail the test for now?

@jitingxu1 jitingxu1 requested a review from cpcloud June 3, 2024 23:55
@cpcloud cpcloud added this to the 9.1 milestone Jun 4, 2024
@cpcloud cpcloud added the bug Incorrect behavior inside of ibis label Jun 4, 2024
@cpcloud cpcloud changed the title fix: manually casting the type of pos to Int16 for table.info() fix(datatypes): manually cast the type of pos to int16 for table.info() Jun 4, 2024
@cpcloud cpcloud changed the title fix(datatypes): manually cast the type of pos to int16 for table.info() fix(datatypes): manually cast the type of pos to int16 for table.info() Jun 4, 2024
@cpcloud cpcloud added the datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) label Jun 4, 2024
@cpcloud cpcloud merged commit 9eb1ed1 into ibis-project:main Jun 5, 2024
81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants