-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for writing deletion vectors in Delta Lake 2.4 #8674
Add support for writing deletion vectors in Delta Lake 2.4 #8674
Conversation
994ee73
to
1221a0e
Compare
Signed-off-by: Andy Grove <andygrove@nvidia.com>
1221a0e
to
1221464
Compare
Signed-off-by: Raza Jafri <raza.jafri@gmail.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
test |
build |
Databricks build failing due to #8726 |
build |
build |
build |
build |
build |
build failed with seemingly unrelated issue:
|
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't feel like we're properly supporting deletion vectors if we're sending the bulk of the data through a CPU UDF. That appears to be the case after glancing at what DeleteWithDeletionVectorsHelper.findTouchedFiles does.
Also what about spark332db which also supports deletion vectors?
Note that we've built GPU versions of Delta Lake CPU UDFs before, and we should be able to do something similar here if necessary, although I haven't scoped the effort. If we've done benchmarking showing that partially supporting deletion vectors, (i.e.: falling back to the CPU to compute the vector values before writing via GPU), significantly outperforms falling back to the CPU to do the entire delete operation, then this would be worth committing. |
I'm working on that as a separate PR since it is much more involved. The tracking issue is #8654 |
Af first glance, it does not look trivial to implement on the GPU, since it looks like we would need to support roaring bitmap format vectors.
We have not benchmarked this. I will move this PR to draft for now but perhaps this should just be closed until we decide to fully GPU-accelerate the deletion vector writes. |
Closes #8554
Changes in this PR:
Note that GPU-accelerating the metadata queries involved will not be trivial due to row-based UDFs, custom data types, and roaring bitmap aggregation operators, which we do not support on GPU.