-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM pipeline] MinHash generation for deduplication #295
[LLM pipeline] MinHash generation for deduplication #295
Conversation
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr :)
Looks good overall and thanks for including tests. We should do this for the other components.
Left a few minor comments.
# Split text into words | ||
words = text.split() | ||
|
||
# Generate shingles of size 3 using nltk's ngrams function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for the having a fixed size of 3? should this be a component parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size of 3 is common. But I agree that is indeed better to make it configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr!
Can you just update the docker file to match the latest version?
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
This component generates MinHashes of text. The MinHash similarity will be used to determine duplicated text passages. --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
This component generates MinHashes of text. The MinHash similarity will be used to determine duplicated text passages. --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
This component generates MinHashes of text.
The MinHash similarity will be used to determine duplicated text passages.