Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add key type and duplicates verification with hashing #2245

Merged
merged 28 commits into from
May 10, 2021

Conversation

NikhilBartwal
Copy link
Contributor

@NikhilBartwal NikhilBartwal commented Apr 20, 2021

Closes #2230
There is currently no verification for the data type and the uniqueness of the keys yielded by the dataset_builder.
This PR is currently a work in progress with the following goals:

  • Adding hash_salt to ArrowWriter so that the keys belonging to different splits have different hash
  • Add key arrtibute to ArrowWriter.write() for hashing
  • Add a hashing class which takes an input key of certain type (str/int/anything convertible to string) and produces a 128-bit hash using hashlib.md5
  • Creating a function giving a custom error message when non-unique keys are found
    [This will take care of type-checking for keys]
  • Checking for duplicate keys in writer.write() for each batch

[NOTE: This PR is currently concerned with GeneratorBasedBuilder only, for simplification. A subsequent PR will be made in future for ArrowBasedBuilder]

@lhoestq Thank you for the feedback. It would be great to have your guidance on this!

@NikhilBartwal NikhilBartwal marked this pull request as ready for review April 25, 2021 21:19
@NikhilBartwal
Copy link
Contributor Author

@lhoestq The tests for key type and duplicate keys have been added and verified successfully.
After generating with an intentionally faulty mnist script, when there is an incompatible key type, it shows:

Downloading and preparing dataset mnist/mnist (download: 11.06 MiB, generated: 60.62 MiB, post-processed: Unknown size, total: 71.67 MiB) to C:\Users\nikhil\.cache\huggingface\datasets\mnist\mnist\1.0.0\5064c25e57a1678f700d2dc798ffe8a6d519405cca7d33670fffda477857a994...
0 examples [00:00, ? examples/s]2021-04-26 02:50:03.703836: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll

FAILURE TO GENERATE DATASET: Invalid key type detected
Found Key [0, 0] of type <class 'list'>
Keys should be either str, int or bytes type

In the case of duplicate keys, it now gives:

Downloading and preparing dataset mnist/mnist (download: 11.06 MiB, generated: 60.62 MiB, post-processed: Unknown size, total: 71.67 MiB) to C:\Users\nikhil\.cache\huggingface\datasets\mnist\mnist\1.0.0\5064c25e57a1678f700d2dc798ffe8a6d519405cca7d33670fffda477857a994...
0 examples [00:00, ? examples/s]2021-04-26 02:53:13.498579: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "f:\datasets\datasets-1\src\datasets\load.py", line 746, in load_dataset
    builder_instance.download_and_prepare(
  File "f:\datasets\datasets-1\src\datasets\builder.py", line 587, in download_and_prepare
    self._download_and_prepare(
  File "f:\datasets\datasets-1\src\datasets\builder.py", line 665, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "f:\datasets\datasets-1\src\datasets\builder.py", line 1002, in _prepare_split
    writer.write(example, key)
  File "f:\datasets\datasets-1\src\datasets\arrow_writer.py", line 321, in write
    self.check_duplicates()
  File "f:\datasets\datasets-1\src\datasets\arrow_writer.py", line 331, in check_duplicates
    raise DuplicatedKeysError(key)
datasets.keyhash.DuplicatedKeysError: FAILURE TO GENERATE DATASET !
Found duplicate Key: 234467
Keys should be unique and deterministic in nature

Please let me know if this is what we wanted to implement. Thanks!

@lhoestq
Copy link
Member

lhoestq commented Apr 26, 2021

This looks pretty cool !
We can make focus on the GeneratorBasedBuilder for now yes.

Do you think we could make the ArrowWriter not look for duplicates by default ?
This way we can just enable duplicate detections when instantiating the writer in the GeneratorBasedBuilder for now.

@NikhilBartwal
Copy link
Contributor Author

Thank you @lhoestq

Do you think we could make the ArrowWriter not look for duplicates by default ?

We can definitely do that by including a check_duplicates argument while instantiating ArrowWriter().

However, since only GeneratorBasedBuilder uses the write() function (which includes the detection code) and the others like ArrowBasedBuilder use write_table() which remains as it was (without duplicate detection). I don't think it would be necessary.

Nonetheless, doing this would require just some small changes. Please let me know your thoughts on this. Thanks!

@lhoestq
Copy link
Member

lhoestq commented Apr 27, 2021

I like the idea of having the duplicate detection optional for other uses of the ArrowWriter.
This class is the main tool to write python data in arrow format so I'd expect it to be flexible.
That's why I think by default it shouldn't require users to provide keys or do any duplicates detection.

An alternative would be to subclass the writer to include duplicates detection in another class.

Both options are fine for me, let me know what you think !

@NikhilBartwal
Copy link
Contributor Author

This class is the main tool to write python data in arrow format so I'd expect it to be flexible.
That's why I think by default it shouldn't require users to provide keys or do any duplicates detection.

Well, that makes sense as the writer can indeed be used for other purposes as well.

We can definitely do that by including a check_duplicates argument while instantiating ArrowWriter().

I think that this would be the simplest and the more efficient option for achieving this as subclassing the writer only for this would lead to unnecessary complexity and code duplication (in case of writer()).

I will be adding the changes soon. Thanks for the feedback @lhoestq!

@NikhilBartwal
Copy link
Contributor Author

NikhilBartwal commented Apr 28, 2021

@lhoestq I have pushed the final changes just now.
Now, the keys and duplicate checking will be necessary only when the ArrowWriter is initialized with check_duplicates=True specifically (in this case, for GeneratorBasedBuilders)

Let me know if this is what was required. Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really cool thanks !
Also pinging @albertvillanova for opinions

Let us know if you need help regarding the tests or other CI failures.
Also feel free to add a test in test_arrow_writer.py to make sure it works as expected :)

src/datasets/arrow_writer.py Outdated Show resolved Hide resolved
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@NikhilBartwal
Copy link
Contributor Author

@lhoestq Thanks for the feedback! I will be adding the tests for the same very soon.

However, I'm not sure as to what exactly is causing the segmentation fault in the failing CI tests. It seems to be something from test_concatenation_table_cast from test_table.py, but I'm not sure as to what exactly. Would be great if you could help. Thanks!

@lhoestq
Copy link
Member

lhoestq commented May 3, 2021

You can merge master into your branch to fix this issue.
Basically pyarrow 4.0.0 has a segfault issue (which has now been resolved on the master branch of pyarrow).
So until 4.0.1 comes out we changed to using pyarrow<4.0.0 recently.

@NikhilBartwal
Copy link
Contributor Author

@lhoestq Thanks for the help with the CI failures. Apologies for the multiple merge commits. My local repo got messy while merging which led to this.
Will be pushing the commit for the tests soon!

@NikhilBartwal
Copy link
Contributor Author

NikhilBartwal commented May 5, 2021

Hey @lhoestq, I've just added the required tests for checking key duplicates and invalid key data types.
I think we have caught a nice little issue as 27 datasets are currently using non-unique keys (hence, the failing tests: All these datasets are giving DuplicateKeysError during testing).
These datasets were not detected earlier as there was no key checking when num_examples < writer_batch_size due to which they passed the dummy data generation test. This bug was fixed by adding the test to writer.finalize() method as well for checking any leftover examples from batches.

I'd like to make changes to the faulty datasets' scripts. However, I was wondering if I should do that in this PR itself or open a new PR as this might get messy in the same PR. Let me know your thoughts on this. Thanks!

@lhoestq lhoestq mentioned this pull request May 7, 2021
@lhoestq
Copy link
Member

lhoestq commented May 7, 2021

Hi ! Once #2333 is merged, feel free to merge master into your branch to fix the CI :)

@NikhilBartwal
Copy link
Contributor Author

Thanks a lot for the help @lhoestq. Besides merging the new changes, I guess this PR is completed for now :)

@lhoestq
Copy link
Member

lhoestq commented May 7, 2021

I just merged the PR, feel free to merge master into your branch. It should fix most most of the CI issues. If there are some left we can fix them in this PR :)

@NikhilBartwal
Copy link
Contributor Author

@lhoestq Looks like the PR is completed now. Thanks for helping me out so much in this :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did another review. It's pretty much ready to merge

I just had other comments, sorry:

tests/test_arrow_writer.py Outdated Show resolved Hide resolved
tests/test_arrow_writer.py Outdated Show resolved Hide resolved
tests/test_arrow_writer.py Show resolved Hide resolved
src/datasets/arrow_writer.py Outdated Show resolved Hide resolved
src/datasets/arrow_writer.py Outdated Show resolved Hide resolved
src/datasets/arrow_writer.py Outdated Show resolved Hide resolved
NikhilBartwal and others added 3 commits May 7, 2021 22:55
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@NikhilBartwal
Copy link
Contributor Author

Hey @lhoestq, I've added the test and corrected the Cl errors as well. Do let me know if this requires any change. Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you ! :)

src/datasets/arrow_writer.py Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented May 10, 2021

Merging. I'll update the comment on the master branch (for some reason I can edit files on this branch)

@lhoestq lhoestq merged commit beca084 into huggingface:master May 10, 2021
@NikhilBartwal
Copy link
Contributor Author

@lhoestq Thank you for the help and feedback. Feels great to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keys yielded while generating dataset are not being checked
2 participants