-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for HashBytes operation on multiple backends #8082
Comments
it will not be as fast as a native implementation (or a [ins] In [1]: import ibis
[ins] In [2]: ibis.options.interactive = True
[ins] In [3]: t = ibis.examples.penguins.fetch()
[ins] In [4]: t.limit(3)
Out[4]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
│ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
│ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘
[ins] In [5]: from hashlib import md5
[ins] In [6]: @ibis.udf.scalar.python
...: def md5_hash(s: str) -> str:
...: return md5(s.encode()).hexdigest()
...:
[ins] In [7]: t = t.mutate(md5_hash=md5_hash(ibis._.species)).relocate("md5_hash")
[ins] In [8]: t.limit(3)
Out[8]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ md5_hash ┃ species ┃ island ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex ┃ year ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string │ string │ string │ float64 │ float64 │ int64 │ int64 │ string │ int64 │
├──────────────────────────────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ aee095557d7c4c311ffb9718b791ad18 │ Adelie │ Torgersen │ 39.1 │ 18.7 │ 181 │ 3750 │ male │ 2007 │
│ aee095557d7c4c311ffb9718b791ad18 │ Adelie │ Torgersen │ 39.5 │ 17.4 │ 186 │ 3800 │ female │ 2007 │
│ aee095557d7c4c311ffb9718b791ad18 │ Adelie │ Torgersen │ 40.3 │ 18.0 │ 195 │ 3250 │ female │ 2007 │
└──────────────────────────────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘ |
Hey @pdgarden -- hashing functions are a bit of a mess depending on the backend in use. Are you looking for a function that returns a string (like a |
Hi, Thank you very much for your quick reply. I am looking for a function which would ideally return a string like |
## Description of changes This adds support for `ops.HashBytes` to `mssql` and also adds a test for that functionality so it's easier to port when we merge in the epic split branch. I've also added a new op, `HashHexDigest` which returns the hexdigest of various cryptographic hashing functions since I imagine this is what many users are _probably_ after. This newer op (and corresponding `hexdigest` method) can also support many more backends, as most of them default to returning the string hex digest and not the raw binary output. I tried to be very accurate in the `notimpl` and `notyet` portions of both tests and I think I've done that. For now, only exposing DuckDB, Pyspark, and MSSQL so we don't add a huge extra burden for the epic split but also address the user request in #8082 And I guess now we can commence debate over the method name? 🐎 ## Issues closed Resolves #8082 --------- Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>
Hey @pdgarden -- there's a new |
That's great, thank you for your support and reactivity. |
…project#8107) This adds support for `ops.HashBytes` to `mssql` and also adds a test for that functionality so it's easier to port when we merge in the epic split branch. I've also added a new op, `HashHexDigest` which returns the hexdigest of various cryptographic hashing functions since I imagine this is what many users are _probably_ after. This newer op (and corresponding `hexdigest` method) can also support many more backends, as most of them default to returning the string hex digest and not the raw binary output. I tried to be very accurate in the `notimpl` and `notyet` portions of both tests and I think I've done that. For now, only exposing DuckDB, Pyspark, and MSSQL so we don't add a huge extra burden for the epic split but also address the user request in And I guess now we can commence debate over the method name? 🐎 Resolves ibis-project#8082 --------- Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>
Is your feature request related to a problem?
I would need to use md5 function with several backends that I use on different environments (DuckDB, Pyspark, MSSQL), with Pyspark being the most essential one.
Describe the solution you'd like
Add support for operation
ibis.expr.operations.generic.HashBytes
on the following backends : DuckDB, Pyspark, MSSQLWhat version of ibis are you running?
7.2.0
What backend(s) are you using, if any?
DuckDB, Pyspark, MSSQL
Code of Conduct
The text was updated successfully, but these errors were encountered: