How to do sentence_dedup #556

ftgreat · 2025-01-20T01:29:04Z

Search before continuing 先搜索，再继续

I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。

Description 描述

请问怎么做句子级的去重，有没有推荐的开源实现？谢谢

Use case 使用场景

No response

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR？

Yes I'd like to help by submitting a PR! 是的！我愿意提供帮助并提交一个PR！

yxdyc · 2025-01-21T08:15:44Z

Thank you for your question. Currently, to my knowledge, there is no universally accepted practice for foundation model deduplication, particularly for sentence-level usage and post-tuning scenarios. For example,

in SemDedup, CLIP and OPT are used to embed samples and to perform semantic deduplication with k-means clustering;
in the Qwen technical report, Qwen-2.5-Math, DeepSeek-LLM, and the LLaMA technical report, MinHash+LSH deduplication techniques are employed.
in the LLaMA technical report, the n-gram coverage ratio is used to identify and remove lines with repeated content, and the RoBERTa model is utilized to perform semantic deduplication with clustering and scoring based on quality and difficulty.

Data-Juicer currently provides some commonly used OPs for document-level and string-level deduplication, such as exact-match and MinHash+LSH. For sentence-level deduplication, Data-Juicer offers reference implementations based on embeddings, based on n-gram coverage (one may need to call text_chunk_mapper first), and based on LLMs (one may need to feed multiple sentences and adjusting prompts).

We are exploring more advanced deduplication techniques, including fast embeddings and model rewarding. Also, we welcome more community discussions and contributions!

ftgreat added the enhancement New feature or request label Jan 20, 2025

github-project-automation bot moved this to Todo in data-juicer Jan 20, 2025

github-project-automation bot added this to data-juicer Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do sentence_dedup #556

How to do sentence_dedup #556

ftgreat commented Jan 20, 2025

yxdyc commented Jan 21, 2025

How to do sentence_dedup #556

How to do sentence_dedup #556

Comments

ftgreat commented Jan 20, 2025

Search before continuing 先搜索，再继续

Description 描述

Use case 使用场景

Additional 额外信息

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR？

yxdyc commented Jan 21, 2025