Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do sentence_dedup #556

Open
1 of 2 tasks
ftgreat opened this issue Jan 20, 2025 · 1 comment
Open
1 of 2 tasks

How to do sentence_dedup #556

ftgreat opened this issue Jan 20, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@ftgreat
Copy link

ftgreat commented Jan 20, 2025

Search before continuing 先搜索,再继续

  • I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。

Description 描述

请问怎么做句子级的去重,有没有推荐的开源实现?谢谢

Use case 使用场景

No response

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR?

  • Yes I'd like to help by submitting a PR! 是的!我愿意提供帮助并提交一个PR!
@ftgreat ftgreat added the enhancement New feature or request label Jan 20, 2025
@yxdyc
Copy link
Collaborator

yxdyc commented Jan 21, 2025

Thank you for your question. Currently, to my knowledge, there is no universally accepted practice for foundation model deduplication, particularly for sentence-level usage and post-tuning scenarios. For example,

  • in SemDedup, CLIP and OPT are used to embed samples and to perform semantic deduplication with k-means clustering;
  • in the Qwen technical report, Qwen-2.5-Math, DeepSeek-LLM, and the LLaMA technical report, MinHash+LSH deduplication techniques are employed.
  • in the LLaMA technical report, the n-gram coverage ratio is used to identify and remove lines with repeated content, and the RoBERTa model is utilized to perform semantic deduplication with clustering and scoring based on quality and difficulty.

Data-Juicer currently provides some commonly used OPs for document-level and string-level deduplication, such as exact-match and MinHash+LSH. For sentence-level deduplication, Data-Juicer offers reference implementations based on embeddings, based on n-gram coverage (one may need to call text_chunk_mapper first), and based on LLMs (one may need to feed multiple sentences and adjusting prompts).

We are exploring more advanced deduplication techniques, including fast embeddings and model rewarding. Also, we welcome more community discussions and contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants