-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: execution of struct literals in the pandas and dask backends returns the wrong value #8687
Comments
hmmm, actually maybe this is just a repr thing?? EDIT: while the repr is confusing, the actual result from .execute() is also wrong. |
fixes ibis-project#8289 This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in some cases. Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup. You can test this locally with eg `pytest -m duckdb -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py` Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns Also, fix executing NULL arrays on pandas. Also, fix casting structs on pandas. See ibis-project#8687 Also, support passing in None. Also, error when the value type can't be inferred from empty python literals (eg what is the value type for the elements of []?) Also, make the type argument for struct() always have an effect, not just when passing in python literals. So basically it can act like a cast. Also, make these constructors idempotent.
I added a workaround at https://github.com/ibis-project/ibis/pull/8666/files#diff-7760dac05d92d4bd73e3b729a90b7002cc3920f03abd1dac5fc7ec18154f9fb2R87-R93. Would love someone's help with debugging and then fixing this at the root of the problem, it's too deep in ibis for me. |
Yeah, this is just a repr thing - if you make a custom |
Ah, not specific enough, in my first examples its just a repr thing. But in the 2nd example above with the .execute(), it is a real problem that needs to get fixed in ibis. |
Indeed, and PRs welcome. The pandas and dask backends are very low priority right now. Are you using them for something in particular? |
fixes ibis-project#8289 This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in some cases. Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup. You can test this locally with eg `pytest -m duckdb -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py` Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns Also, fix executing NULL arrays on pandas. Also, fix casting structs on pandas. See ibis-project#8687 Also, support passing in None. Also, error when the value type can't be inferred from empty python literals (eg what is the value type for the elements of []?) Also, make the type argument for struct() always have an effect, not just when passing in python literals. So basically it can act like a cast. Also, make these constructors idempotent.
@cpcloud I see you changed the issue title to "subset of the struct". But just to make sure you see the difference, the current output is |
No not at all. I just thought it would be better to get the the tests passing instead of marking them as xfail. If we want to punt on this, then I would think it would be better to just error on this cast, instead of what we do now which is silently return the wrong result. |
This was motivated to work around pandas not repr'ing `frozendict` elements properly (seen in #8687), but while poking at that I found: - The existing code was unnecessarily nested (the actual dict was boxed in a `MappingProxyType` which was boxed in a `FrozenDict` - we can do better by just using storing the data in the `FrozenDict` itself). - There was a bug in the hash implementation where the order mattered, meaning that `hash(frozendict(a=1, b=2)) != hash(frozendict(b=2, a=1))`. This has since been fixed. Fixes #8687. --------- Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
fixes ibis-project#8289 This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in some cases. The big structural change is that now the core Operations for Array and Structs have a different internal representation, so they can distringuish between - the entire value is NULL - the contained values are NULL Before, ops.Array held onto a `VarTuple[Value]`. So the contained Values could be NULL, but there was no way to say the entire thing was null. Now, ops.Array stores a `None | VarTuple[Value]`. The same thing for ops.StructValue. ops.Map didn't suffer from this, because it stores a `ops.Array`s internally, so since `ops.Array` can distinguish between entirely-NULL and contains-NULL, so can ops.Map A fallout of this is that ops.Array needs a way to explicitly store its dtype. Before, it derived its dtype based on the dtype of its args. But now that `None` is a valid value, it is now possible for there to be no values to inspect! So the Op actually stores its dtype explicitly. If you pass in values, then supplying the dtype on construction is optional, we go back to the old behavior of deriving it from the inputs. This requires the backend compilers to now deal with that case. Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup. You can test this locally with eg `pytest -m <backend> -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py` Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns Also, fix executing NULL arrays on pandas. Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes. Also, fix casting structs on pandas. See ibis-project#8687 Also, support passing in None to all these constructors. Also, error when the value type can't be inferred from empty python literals (eg what is the value type for the elements of []?) Also, make the type argument for struct() always have an effect, not just when passing in python literals. So basically it can act like a cast. Also, make these constructors idempotent.
fixes ibis-project#8289 This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in some cases. The big structural change is that now the core Operations for Array and Structs have a different internal representation, so they can distringuish between - the entire value is NULL - the contained values are NULL Before, ops.Array held onto a `VarTuple[Value]`. So the contained Values could be NULL, but there was no way to say the entire thing was null. Now, ops.Array stores a `None | VarTuple[Value]`. The same thing for ops.StructValue. ops.Map didn't suffer from this, because it stores a `ops.Array`s internally, so since `ops.Array` can distinguish between entirely-NULL and contains-NULL, so can ops.Map A fallout of this is that ops.Array needs a way to explicitly store its dtype. Before, it derived its dtype based on the dtype of its args. But now that `None` is a valid value, it is now possible for there to be no values to inspect! So the Op actually stores its dtype explicitly. If you pass in values, then supplying the dtype on construction is optional, we go back to the old behavior of deriving it from the inputs. This requires the backend compilers to now deal with that case. Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup. You can test this locally with eg `pytest -m <backend> -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py` Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns Also, fix executing NULL arrays on pandas. Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes. Also, fix casting structs on pandas. See ibis-project#8687 Also, support passing in None to all these constructors. Also, error when the value type can't be inferred from empty python literals (eg what is the value type for the elements of []?) Also, make the type argument for struct() always have an effect, not just when passing in python literals. So basically it can act like a cast. Also, make these constructors idempotent.
fixes ibis-project#8289 This does a lot of changes. It was hard for me to separate them out as I implemented them. But now that it's all hashed out, I can try to split this up into separate commits if you want. But that might be sorta hard in some cases. The big structural change is that now the core Operations for Array and Structs have a different internal representation, so they can distringuish between - the entire value is NULL - the contained values are NULL Before, ops.Array held onto a `VarTuple[Value]`. So the contained Values could be NULL, but there was no way to say the entire thing was null. Now, ops.Array stores a `None | VarTuple[Value]`. The same thing for ops.StructValue. ops.Map didn't suffer from this, because it stores a `ops.Array`s internally, so since `ops.Array` can distinguish between entirely-NULL and contains-NULL, so can ops.Map A fallout of this is that ops.Array needs a way to explicitly store its dtype. Before, it derived its dtype based on the dtype of its args. But now that `None` is a valid value, it is now possible for there to be no values to inspect! So the Op actually stores its dtype explicitly. If you pass in values, then supplying the dtype on construction is optional, we go back to the old behavior of deriving it from the inputs. This requires the backend compilers to now deal with that case. Several of the backends were always broken here, they just weren't getting caught. I marked them as broken, we can fix them in a followup. You can test this locally with eg `pytest -m <backend> -k factory ibis/backends/tests/test_array.py ibis/backends/tests/test_map.py ibis/backends/tests/test_struct.py` Also, fix a typing bug: map() can accept ArrayValues, not just ArrayColumns Also, fix executing NULL arrays on pandas. Also, fixing converting dtypes on clickhouse, Structs should be converted to nonnullable dtypes. Also, fix casting structs on pandas. See ibis-project#8687 Also, support passing in None to all these constructors. Also, error when the value type can't be inferred from empty python literals (eg what is the value type for the elements of []?) Also, make the type argument for struct() always have an effect, not just when passing in python literals. So basically it can act like a cast. Also, make these constructors idempotent.
What happened?
This is causing casting bugs in the pandas and dask backends:
found this in #8666
I think we want to make it so our builtin frozendict looks exactly like a dict to pandas. Then we get the right behavior here
What version of ibis are you using?
main
What backend(s) are you using, if any?
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: