Add `key` type and duplicates verification with hashing #2245

NikhilBartwal · 2021-04-20T20:03:19Z

Closes #2230
There is currently no verification for the data type and the uniqueness of the keys yielded by the dataset_builder.
This PR is currently a work in progress with the following goals:

Adding hash_salt to ArrowWriter so that the keys belonging to different splits have different hash
Add key arrtibute to ArrowWriter.write() for hashing
Add a hashing class which takes an input key of certain type (str/int/anything convertible to string) and produces a 128-bit hash using hashlib.md5
Creating a function giving a custom error message when non-unique keys are found
[This will take care of type-checking for keys]
Checking for duplicate keys in writer.write() for each batch

[NOTE: This PR is currently concerned with GeneratorBasedBuilder only, for simplification. A subsequent PR will be made in future for ArrowBasedBuilder]

@lhoestq Thank you for the feedback. It would be great to have your guidance on this!

NikhilBartwal · 2021-04-25T21:24:28Z

@lhoestq The tests for key type and duplicate keys have been added and verified successfully.
After generating with an intentionally faulty mnist script, when there is an incompatible key type, it shows:

Downloading and preparing dataset mnist/mnist (download: 11.06 MiB, generated: 60.62 MiB, post-processed: Unknown size, total: 71.67 MiB) to C:\Users\nikhil\.cache\huggingface\datasets\mnist\mnist\1.0.0\5064c25e57a1678f700d2dc798ffe8a6d519405cca7d33670fffda477857a994...
0 examples [00:00, ? examples/s]2021-04-26 02:50:03.703836: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll

FAILURE TO GENERATE DATASET: Invalid key type detected
Found Key [0, 0] of type <class 'list'>
Keys should be either str, int or bytes type

In the case of duplicate keys, it now gives:

Downloading and preparing dataset mnist/mnist (download: 11.06 MiB, generated: 60.62 MiB, post-processed: Unknown size, total: 71.67 MiB) to C:\Users\nikhil\.cache\huggingface\datasets\mnist\mnist\1.0.0\5064c25e57a1678f700d2dc798ffe8a6d519405cca7d33670fffda477857a994...
0 examples [00:00, ? examples/s]2021-04-26 02:53:13.498579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "f:\datasets\datasets-1\src\datasets\load.py", line 746, in load_dataset
    builder_instance.download_and_prepare(
  File "f:\datasets\datasets-1\src\datasets\builder.py", line 587, in download_and_prepare
    self._download_and_prepare(
  File "f:\datasets\datasets-1\src\datasets\builder.py", line 665, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "f:\datasets\datasets-1\src\datasets\builder.py", line 1002, in _prepare_split
    writer.write(example, key)
  File "f:\datasets\datasets-1\src\datasets\arrow_writer.py", line 321, in write
    self.check_duplicates()
  File "f:\datasets\datasets-1\src\datasets\arrow_writer.py", line 331, in check_duplicates
    raise DuplicatedKeysError(key)
datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 234467
Keys should be unique and deterministic in nature

Please let me know if this is what we wanted to implement. Thanks!

lhoestq · 2021-04-26T16:07:59Z

This looks pretty cool !
We can make focus on the GeneratorBasedBuilder for now yes.

Do you think we could make the ArrowWriter not look for duplicates by default ?
This way we can just enable duplicate detections when instantiating the writer in the GeneratorBasedBuilder for now.

NikhilBartwal · 2021-04-26T17:03:31Z

Thank you @lhoestq

Do you think we could make the ArrowWriter not look for duplicates by default ?

We can definitely do that by including a check_duplicates argument while instantiating ArrowWriter().

However, since only GeneratorBasedBuilder uses the write() function (which includes the detection code) and the others like ArrowBasedBuilder use write_table() which remains as it was (without duplicate detection). I don't think it would be necessary.

Nonetheless, doing this would require just some small changes. Please let me know your thoughts on this. Thanks!

lhoestq · 2021-04-27T15:37:21Z

I like the idea of having the duplicate detection optional for other uses of the ArrowWriter.
This class is the main tool to write python data in arrow format so I'd expect it to be flexible.
That's why I think by default it shouldn't require users to provide keys or do any duplicates detection.

An alternative would be to subclass the writer to include duplicates detection in another class.

Both options are fine for me, let me know what you think !

NikhilBartwal · 2021-04-27T21:13:58Z

This class is the main tool to write python data in arrow format so I'd expect it to be flexible.
That's why I think by default it shouldn't require users to provide keys or do any duplicates detection.

Well, that makes sense as the writer can indeed be used for other purposes as well.

We can definitely do that by including a check_duplicates argument while instantiating ArrowWriter().

I think that this would be the simplest and the more efficient option for achieving this as subclassing the writer only for this would lead to unnecessary complexity and code duplication (in case of writer()).

I will be adding the changes soon. Thanks for the feedback @lhoestq!

NikhilBartwal · 2021-04-28T20:03:48Z

@lhoestq I have pushed the final changes just now.
Now, the keys and duplicate checking will be necessary only when the ArrowWriter is initialized with check_duplicates=True specifically (in this case, for GeneratorBasedBuilders)

Let me know if this is what was required. Thanks!

lhoestq

That's really cool thanks !
Also pinging @albertvillanova for opinions

Let us know if you need help regarding the tests or other CI failures.
Also feel free to add a test in test_arrow_writer.py to make sure it works as expected :)

src/datasets/arrow_writer.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

NikhilBartwal · 2021-05-03T19:31:59Z

@lhoestq Thanks for the feedback! I will be adding the tests for the same very soon.

However, I'm not sure as to what exactly is causing the segmentation fault in the failing CI tests. It seems to be something from test_concatenation_table_cast from test_table.py, but I'm not sure as to what exactly. Would be great if you could help. Thanks!

lhoestq · 2021-05-03T19:51:53Z

You can merge master into your branch to fix this issue.
Basically pyarrow 4.0.0 has a segfault issue (which has now been resolved on the master branch of pyarrow).
So until 4.0.1 comes out we changed to using pyarrow<4.0.0 recently.

…sets-1 into hash_key_verification

…_key_verification

NikhilBartwal · 2021-05-03T21:51:46Z

@lhoestq Thanks for the help with the CI failures. Apologies for the multiple merge commits. My local repo got messy while merging which led to this.
Will be pushing the commit for the tests soon!

NikhilBartwal · 2021-05-05T20:41:03Z

Hey @lhoestq, I've just added the required tests for checking key duplicates and invalid key data types.
I think we have caught a nice little issue as 27 datasets are currently using non-unique keys (hence, the failing tests: All these datasets are giving DuplicateKeysError during testing).
These datasets were not detected earlier as there was no key checking when num_examples < writer_batch_size due to which they passed the dummy data generation test. This bug was fixed by adding the test to writer.finalize() method as well for checking any leftover examples from batches.

I'd like to make changes to the faulty datasets' scripts. However, I was wondering if I should do that in this PR itself or open a new PR as this might get messy in the same PR. Let me know your thoughts on this. Thanks!

lhoestq · 2021-05-07T15:28:49Z

Hi ! Once #2333 is merged, feel free to merge master into your branch to fix the CI :)

NikhilBartwal · 2021-05-07T15:47:17Z

Thanks a lot for the help @lhoestq. Besides merging the new changes, I guess this PR is completed for now :)

lhoestq · 2021-05-07T15:57:45Z

I just merged the PR, feel free to merge master into your branch. It should fix most most of the CI issues. If there are some left we can fix them in this PR :)

…y_verification

NikhilBartwal · 2021-05-07T16:18:33Z

@lhoestq Looks like the PR is completed now. Thanks for helping me out so much in this :)

lhoestq

Just did another review. It's pretty much ready to merge

I just had other comments, sorry:

tests/test_arrow_writer.py

src/datasets/arrow_writer.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

NikhilBartwal · 2021-05-07T18:06:46Z

Hey @lhoestq, I've added the test and corrected the Cl errors as well. Do let me know if this requires any change. Thanks!

lhoestq

LGTM thank you ! :)

src/datasets/arrow_writer.py

lhoestq · 2021-05-10T17:31:18Z

Merging. I'll update the comment on the master branch (for some reason I can edit files on this branch)

NikhilBartwal · 2021-05-10T18:04:37Z

@lhoestq Thank you for the help and feedback. Feels great to contribute!

NikhilBartwal added 5 commits April 21, 2021 01:16

Add hash_salt to ArrowWriter

8b8ca4c

Code quality changes

318f740

Add key attribute to write()

d7fbcdb

Add keyhash.py

2a5eac9

Add KeyHasher class and documentation

95c9956

NikhilBartwal force-pushed the hash_key_verification branch from 0b808ed to 95c9956 Compare April 21, 2021 21:00

NikhilBartwal added 2 commits April 22, 2021 02:34

Fix filename typo

8d8ff56

Add type-checking for keys

976f1d2

NikhilBartwal force-pushed the hash_key_verification branch from 6b0bddc to 976f1d2 Compare April 24, 2021 18:31

NikhilBartwal added 5 commits April 25, 2021 00:11

Add hash method to KeyHasher

b37ddfa

Add KeyHasher to builder.py and ArrowWriter

2aabb05

Fix KeyHasher bugs

f6f16b4

Add duplicate key checking

ec260a8

Minor bug fix

d11c898

NikhilBartwal marked this pull request as ready for review April 25, 2021 21:19

NikhilBartwal added 2 commits April 29, 2021 01:18

Making duplicate checking and keys optional

c7766e0

Rearrange current_xamples structure

4fbf53f

lhoestq reviewed May 3, 2021

View reviewed changes

src/datasets/arrow_writer.py Outdated Show resolved Hide resolved

Default key value to None

a8d1708

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

NikhilBartwal added 3 commits May 4, 2021 01:25

Merge master branch

56400d2

Merge branch 'hash_key_verification' of github.com:NikhilBartwal/data…

8fe744f

…sets-1 into hash_key_verification

Merge remote-tracking branch 'refs/remotes/upstream/master' into hash…

e084611

…_key_verification

NikhilBartwal added 3 commits May 4, 2021 02:28

Update code-style

f5261a4

Update black

9dd361a

Resolve isort errors

98c0e85

NikhilBartwal added 3 commits May 6, 2021 01:21

Add test for invalid key datatype

9b72afb

Add test for checking duplicate keys

d81f61c

Code quality changes

a522725

lhoestq mentioned this pull request May 7, 2021

Fix duplicate keys #2333

Merged

Merge branch 'master' of github.com:huggingface/datasets into hash_ke…

4de5070

…y_verification

lhoestq reviewed May 7, 2021

View reviewed changes

NikhilBartwal and others added 3 commits May 7, 2021 22:55

Apply suggestions from code review

17f150c

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Fix CI errors

a2e1193

Add test for writing with keys

8299b14

lhoestq approved these changes May 10, 2021

View reviewed changes

src/datasets/arrow_writer.py Show resolved Hide resolved

lhoestq merged commit beca084 into huggingface:master May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `key` type and duplicates verification with hashing #2245

Add `key` type and duplicates verification with hashing #2245

NikhilBartwal commented Apr 20, 2021 •

edited

Loading

NikhilBartwal commented Apr 25, 2021

lhoestq commented Apr 26, 2021

NikhilBartwal commented Apr 26, 2021

lhoestq commented Apr 27, 2021

NikhilBartwal commented Apr 27, 2021

NikhilBartwal commented Apr 28, 2021 •

edited

Loading

lhoestq left a comment

NikhilBartwal commented May 3, 2021

lhoestq commented May 3, 2021

NikhilBartwal commented May 3, 2021

NikhilBartwal commented May 5, 2021 •

edited

Loading

lhoestq commented May 7, 2021

NikhilBartwal commented May 7, 2021

lhoestq commented May 7, 2021

NikhilBartwal commented May 7, 2021

lhoestq left a comment

NikhilBartwal commented May 7, 2021

lhoestq left a comment

lhoestq commented May 10, 2021

NikhilBartwal commented May 10, 2021

Add key type and duplicates verification with hashing #2245

Add key type and duplicates verification with hashing #2245

Conversation

NikhilBartwal commented Apr 20, 2021 • edited Loading

NikhilBartwal commented Apr 25, 2021

lhoestq commented Apr 26, 2021

NikhilBartwal commented Apr 26, 2021

lhoestq commented Apr 27, 2021

NikhilBartwal commented Apr 27, 2021

NikhilBartwal commented Apr 28, 2021 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

NikhilBartwal commented May 3, 2021

lhoestq commented May 3, 2021

NikhilBartwal commented May 3, 2021

NikhilBartwal commented May 5, 2021 • edited Loading

lhoestq commented May 7, 2021

NikhilBartwal commented May 7, 2021

lhoestq commented May 7, 2021

NikhilBartwal commented May 7, 2021

lhoestq left a comment

Choose a reason for hiding this comment

NikhilBartwal commented May 7, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented May 10, 2021

NikhilBartwal commented May 10, 2021

Add `key` type and duplicates verification with hashing #2245

Add `key` type and duplicates verification with hashing #2245

NikhilBartwal commented Apr 20, 2021 •

edited

Loading

NikhilBartwal commented Apr 28, 2021 •

edited

Loading

NikhilBartwal commented May 5, 2021 •

edited

Loading