feat: add support for HashBytes operation on multiple backends #8082

pdgarden · 2024-01-23T23:02:25Z

Is your feature request related to a problem?

I would need to use md5 function with several backends that I use on different environments (DuckDB, Pyspark, MSSQL), with Pyspark being the most essential one.

Describe the solution you'd like

Add support for operation ibis.expr.operations.generic.HashBytes on the following backends : DuckDB, Pyspark, MSSQL

What version of ibis are you running?

7.2.0

What backend(s) are you using, if any?

DuckDB, Pyspark, MSSQL

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

lostmygithubaccount · 2024-01-25T00:31:09Z

it will not be as fast as a native implementation (or a pyarrow UDF instead of python), but you can accomplish this now with a UDF:

[ins] In [1]: import ibis

[ins] In [2]: ibis.options.interactive = True

[ins] In [3]: t = ibis.examples.penguins.fetch()

[ins] In [4]: t.limit(3)
Out[4]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male   │  2007 │
│ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female │  2007 │
│ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female │  2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

[ins] In [5]: from hashlib import md5

[ins] In [6]: @ibis.udf.scalar.python
         ...: def md5_hash(s: str) -> str:
         ...:     return md5(s.encode()).hexdigest()
         ...:

[ins] In [7]: t = t.mutate(md5_hash=md5_hash(ibis._.species)).relocate("md5_hash")

[ins] In [8]: t.limit(3)
Out[8]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ md5_hash                         ┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string                           │ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ int64 │
├──────────────────────────────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ aee095557d7c4c311ffb9718b791ad18 │ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male   │  2007 │
│ aee095557d7c4c311ffb9718b791ad18 │ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female │  2007 │
│ aee095557d7c4c311ffb9718b791ad18 │ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female │  2007 │
└──────────────────────────────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

gforsyth · 2024-01-25T21:11:06Z

Hey @pdgarden -- hashing functions are a bit of a mess depending on the backend in use.

Are you looking for a function that returns a string (like a hexdigest) or bytes (like a digest)?

pdgarden · 2024-01-29T04:30:13Z

Hi,

Thank you very much for your quick reply. I am looking for a function which would ideally return a string like hexdigest.

## Description of changes This adds support for `ops.HashBytes` to `mssql` and also adds a test for that functionality so it's easier to port when we merge in the epic split branch. I've also added a new op, `HashHexDigest` which returns the hexdigest of various cryptographic hashing functions since I imagine this is what many users are _probably_ after. This newer op (and corresponding `hexdigest` method) can also support many more backends, as most of them default to returning the string hex digest and not the raw binary output. I tried to be very accurate in the `notimpl` and `notyet` portions of both tests and I think I've done that. For now, only exposing DuckDB, Pyspark, and MSSQL so we don't add a huge extra burden for the epic split but also address the user request in #8082 And I guess now we can commence debate over the method name? 🐎 ## Issues closed Resolves #8082 --------- Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

gforsyth · 2024-01-29T18:13:33Z

Hey @pdgarden -- there's a new hexdigest method that was just merged in. It will be available in the next release (sometime next week) and you can use that to get the hexdigest of the input columns. Currently it is supported with DuckDB, mssql, and pyspark. More support coming soon.

pdgarden · 2024-01-30T05:41:55Z

That's great, thank you for your support and reactivity.

…project#8107) This adds support for `ops.HashBytes` to `mssql` and also adds a test for that functionality so it's easier to port when we merge in the epic split branch. I've also added a new op, `HashHexDigest` which returns the hexdigest of various cryptographic hashing functions since I imagine this is what many users are _probably_ after. This newer op (and corresponding `hexdigest` method) can also support many more backends, as most of them default to returning the string hex digest and not the raw binary output. I tried to be very accurate in the `notimpl` and `notyet` portions of both tests and I think I've done that. For now, only exposing DuckDB, Pyspark, and MSSQL so we don't add a huge extra burden for the epic split but also address the user request in And I guess now we can commence debate over the method name? 🐎 Resolves ibis-project#8082 --------- Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>

pdgarden added the feature Features or general enhancements label Jan 23, 2024

github-project-automation bot added this to Ibis planning and roadmap Jan 23, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jan 23, 2024

gforsyth mentioned this issue Jan 26, 2024

feat(mssql): add hashbytes and test for binary output hash fns #8107

Merged

cpcloud closed this as completed in #8107 Jan 29, 2024

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for HashBytes operation on multiple backends #8082

feat: add support for HashBytes operation on multiple backends #8082

pdgarden commented Jan 23, 2024

lostmygithubaccount commented Jan 25, 2024

gforsyth commented Jan 25, 2024

pdgarden commented Jan 29, 2024

gforsyth commented Jan 29, 2024

pdgarden commented Jan 30, 2024

feat: add support for HashBytes operation on multiple backends #8082

feat: add support for HashBytes operation on multiple backends #8082

Comments

pdgarden commented Jan 23, 2024

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

lostmygithubaccount commented Jan 25, 2024

gforsyth commented Jan 25, 2024

pdgarden commented Jan 29, 2024

gforsyth commented Jan 29, 2024

pdgarden commented Jan 30, 2024