-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add key
type and duplicates verification with hashing
#2245
Add key
type and duplicates verification with hashing
#2245
Conversation
0b808ed
to
95c9956
Compare
6b0bddc
to
976f1d2
Compare
@lhoestq The tests for key type and duplicate keys have been added and verified successfully.
In the case of duplicate keys, it now gives:
Please let me know if this is what we wanted to implement. Thanks! |
This looks pretty cool ! Do you think we could make the ArrowWriter not look for duplicates by default ? |
Thank you @lhoestq
We can definitely do that by including a However, since only Nonetheless, doing this would require just some small changes. Please let me know your thoughts on this. Thanks! |
I like the idea of having the duplicate detection optional for other uses of the ArrowWriter. An alternative would be to subclass the writer to include duplicates detection in another class. Both options are fine for me, let me know what you think ! |
Well, that makes sense as the writer can indeed be used for other purposes as well.
I think that this would be the simplest and the more efficient option for achieving this as subclassing the writer only for this would lead to unnecessary complexity and code duplication (in case of I will be adding the changes soon. Thanks for the feedback @lhoestq! |
@lhoestq I have pushed the final changes just now. Let me know if this is what was required. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really cool thanks !
Also pinging @albertvillanova for opinions
Let us know if you need help regarding the tests or other CI failures.
Also feel free to add a test in test_arrow_writer.py to make sure it works as expected :)
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@lhoestq Thanks for the feedback! I will be adding the tests for the same very soon. However, I'm not sure as to what exactly is causing the |
You can merge master into your branch to fix this issue. |
…sets-1 into hash_key_verification
…_key_verification
@lhoestq Thanks for the help with the CI failures. Apologies for the multiple merge commits. My local repo got messy while merging which led to this. |
Hey @lhoestq, I've just added the required tests for checking key duplicates and invalid key data types. I'd like to make changes to the faulty datasets' scripts. However, I was wondering if I should do that in this PR itself or open a new PR as this might get messy in the same PR. Let me know your thoughts on this. Thanks! |
Hi ! Once #2333 is merged, feel free to merge master into your branch to fix the CI :) |
Thanks a lot for the help @lhoestq. Besides merging the new changes, I guess this PR is completed for now :) |
I just merged the PR, feel free to merge |
@lhoestq Looks like the PR is completed now. Thanks for helping me out so much in this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did another review. It's pretty much ready to merge
I just had other comments, sorry:
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Hey @lhoestq, I've added the test and corrected the Cl errors as well. Do let me know if this requires any change. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you ! :)
Merging. I'll update the comment on the master branch (for some reason I can edit files on this branch) |
@lhoestq Thank you for the help and feedback. Feels great to contribute! |
Closes #2230
There is currently no verification for the data type and the uniqueness of the keys yielded by the
dataset_builder
.This PR is currently a work in progress with the following goals:
hash_salt
toArrowWriter
so that the keys belonging to different splits have different hashkey
arrtibute toArrowWriter.write()
for hashingstr
/int
/anything convertible to string) and produces a 128-bit hash usinghashlib.md5
[This will take care of type-checking for keys]
writer.write()
for each batch[NOTE: This PR is currently concerned with
GeneratorBasedBuilder
only, for simplification. A subsequent PR will be made in future forArrowBasedBuilder
]@lhoestq Thank you for the feedback. It would be great to have your guidance on this!