Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text-dedup #96

Open
HungHoangDinh opened this issue Aug 12, 2024 · 5 comments
Open

text-dedup #96

HungHoangDinh opened this issue Aug 12, 2024 · 5 comments

Comments

@HungHoangDinh
Copy link

I have dedup my dataset. However, I cannot know which data is discarded and which data it overlaps with. How can I check?
Hope you can help me!

@ChenghaoMou
Copy link
Owner

You can do so by modifying the code for filtering (e.g. in minhash.py)

Duplicates are clustered and by default only the one with the index == cluster id is kept. A union find object is used to find the corresponding cluster id for each index. You can save the uf object and dataset halfway so that you can debug them.

@HungHoangDinh
Copy link
Author

HungHoangDinh commented Aug 12, 2024

I want to check which sentences in the dataset the eliminated sentences match. Can you help me to do this?

@ChenghaoMou
Copy link
Owner

You just need to save the dataset after this line, assuming you are familiar with Huggingface's datasets library.

The dataset saved will contain the cluster id. You can iterate each cluster to see duplicates.

@HungHoangDinh
Copy link
Author

My text set has 700 thousand sentences. After using text-dedup, there are only 600 sentences left. I want to increase the number of sentences retained. I used minhash to increase the threshold and num_perm, but the results did not change. Can you help me solve this problem?

@ChenghaoMou
Copy link
Owner

Here is some information you could provide to facilitate this conversion:

  1. Have you tried the suggestions I provided above? If so, what issues did you see from the results?
  2. Have you read the code to make sure it suits your dataset?
  3. Could you provide more information about your dataset (such as example set, language, length) and command used so I can reproduce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants