Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for HashBytes operation on multiple backends #8082

Closed
1 task done
pdgarden opened this issue Jan 23, 2024 · 5 comments · Fixed by #8107
Closed
1 task done

feat: add support for HashBytes operation on multiple backends #8082

pdgarden opened this issue Jan 23, 2024 · 5 comments · Fixed by #8107
Labels
feature Features or general enhancements

Comments

@pdgarden
Copy link

Is your feature request related to a problem?

I would need to use md5 function with several backends that I use on different environments (DuckDB, Pyspark, MSSQL), with Pyspark being the most essential one.

Describe the solution you'd like

Add support for operation ibis.expr.operations.generic.HashBytes on the following backends : DuckDB, Pyspark, MSSQL

What version of ibis are you running?

7.2.0

What backend(s) are you using, if any?

DuckDB, Pyspark, MSSQL

Code of Conduct

  • I agree to follow this project's Code of Conduct
@pdgarden pdgarden added the feature Features or general enhancements label Jan 23, 2024
@lostmygithubaccount
Copy link
Member

it will not be as fast as a native implementation (or a pyarrow UDF instead of python), but you can accomplish this now with a UDF:

[ins] In [1]: import ibis

[ins] In [2]: ibis.options.interactive = True

[ins] In [3]: t = ibis.examples.penguins.fetch()

[ins] In [4]: t.limit(3)
Out[4]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ AdelieTorgersen39.118.71813750male2007 │
│ AdelieTorgersen39.517.41863800female2007 │
│ AdelieTorgersen40.318.01953250female2007 │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

[ins] In [5]: from hashlib import md5

[ins] In [6]: @ibis.udf.scalar.python
         ...: def md5_hash(s: str) -> str:
         ...:     return md5(s.encode()).hexdigest()
         ...:

[ins] In [7]: t = t.mutate(md5_hash=md5_hash(ibis._.species)).relocate("md5_hash")

[ins] In [8]: t.limit(3)
Out[8]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ md5_hashspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringstringfloat64float64int64int64stringint64 │
├──────────────────────────────────┼─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ aee095557d7c4c311ffb9718b791ad18AdelieTorgersen39.118.71813750male2007 │
│ aee095557d7c4c311ffb9718b791ad18AdelieTorgersen39.517.41863800female2007 │
│ aee095557d7c4c311ffb9718b791ad18AdelieTorgersen40.318.01953250female2007 │
└──────────────────────────────────┴─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

@gforsyth
Copy link
Member

Hey @pdgarden -- hashing functions are a bit of a mess depending on the backend in use.

Are you looking for a function that returns a string (like a hexdigest) or bytes (like a digest)?

@pdgarden
Copy link
Author

Hi,

Thank you very much for your quick reply. I am looking for a function which would ideally return a string like hexdigest.

cpcloud pushed a commit that referenced this issue Jan 29, 2024
## Description of changes

This adds support for `ops.HashBytes` to `mssql` and also adds a test
for that functionality so it's easier to port when we merge in the epic
split branch.

I've also added a new op, `HashHexDigest` which returns the hexdigest of
various cryptographic hashing functions since I imagine this is what
many users are _probably_ after. This newer op (and corresponding
`hexdigest` method) can also support many more backends, as most of them
default to returning the string hex digest and not the raw binary
output.

I tried to be very accurate in the `notimpl` and `notyet` portions of
both tests and I think I've done that.

For now, only exposing DuckDB, Pyspark, and MSSQL so we don't add a huge
extra burden for the epic split but also address the user request in
#8082

And I guess now we can commence debate over the method name? 🐎

## Issues closed

Resolves #8082

---------

Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jan 29, 2024
@gforsyth
Copy link
Member

Hey @pdgarden -- there's a new hexdigest method that was just merged in. It will be available in the next release (sometime next week) and you can use that to get the hexdigest of the input columns. Currently it is supported with DuckDB, mssql, and pyspark. More support coming soon.

@pdgarden
Copy link
Author

That's great, thank you for your support and reactivity.

gforsyth added a commit to gforsyth/ibis that referenced this issue Feb 1, 2024
…project#8107)

This adds support for `ops.HashBytes` to `mssql` and also adds a test
for that functionality so it's easier to port when we merge in the epic
split branch.

I've also added a new op, `HashHexDigest` which returns the hexdigest of
various cryptographic hashing functions since I imagine this is what
many users are _probably_ after. This newer op (and corresponding
`hexdigest` method) can also support many more backends, as most of them
default to returning the string hex digest and not the raw binary
output.

I tried to be very accurate in the `notimpl` and `notyet` portions of
both tests and I think I've done that.

For now, only exposing DuckDB, Pyspark, and MSSQL so we don't add a huge
extra burden for the epic split but also address the user request in

And I guess now we can commence debate over the method name? 🐎

Resolves ibis-project#8082

---------

Co-authored-by: Jim Crist-Harif <jcristharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants