text-dedup #96

HungHoangDinh · 2024-08-12T08:33:56Z

I have dedup my dataset. However, I cannot know which data is discarded and which data it overlaps with. How can I check?
Hope you can help me!

ChenghaoMou · 2024-08-12T09:07:52Z

You can do so by modifying the code for filtering (e.g. in minhash.py)

Duplicates are clustered and by default only the one with the index == cluster id is kept. A union find object is used to find the corresponding cluster id for each index. You can save the uf object and dataset halfway so that you can debug them.

HungHoangDinh · 2024-08-12T16:32:49Z

I want to check which sentences in the dataset the eliminated sentences match. Can you help me to do this?

ChenghaoMou · 2024-08-12T16:48:04Z

You just need to save the dataset after this line, assuming you are familiar with Huggingface's datasets library.

The dataset saved will contain the cluster id. You can iterate each cluster to see duplicates.

HungHoangDinh · 2024-08-14T08:55:08Z

My text set has 700 thousand sentences. After using text-dedup, there are only 600 sentences left. I want to increase the number of sentences retained. I used minhash to increase the threshold and num_perm, but the results did not change. Can you help me solve this problem?

ChenghaoMou · 2024-08-14T16:50:43Z

Here is some information you could provide to facilitate this conversion:

Have you tried the suggestions I provided above? If so, what issues did you see from the results?
Have you read the code to make sure it suits your dataset?
Could you provide more information about your dataset (such as example set, language, length) and command used so I can reproduce?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-dedup #96

text-dedup #96

HungHoangDinh commented Aug 12, 2024

ChenghaoMou commented Aug 12, 2024

HungHoangDinh commented Aug 12, 2024 •

edited

Loading

ChenghaoMou commented Aug 12, 2024

HungHoangDinh commented Aug 14, 2024

ChenghaoMou commented Aug 14, 2024

text-dedup #96

text-dedup #96

Comments

HungHoangDinh commented Aug 12, 2024

ChenghaoMou commented Aug 12, 2024

HungHoangDinh commented Aug 12, 2024 • edited Loading

ChenghaoMou commented Aug 12, 2024

HungHoangDinh commented Aug 14, 2024

ChenghaoMou commented Aug 14, 2024

HungHoangDinh commented Aug 12, 2024 •

edited

Loading