You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your question. Currently, to my knowledge, there is no universally accepted practice for foundation model deduplication, particularly for sentence-level usage and post-tuning scenarios. For example,
in SemDedup, CLIP and OPT are used to embed samples and to perform semantic deduplication with k-means clustering;
in the LLaMA technical report, the n-gram coverage ratio is used to identify and remove lines with repeated content, and the RoBERTa model is utilized to perform semantic deduplication with clustering and scoring based on quality and difficulty.
Data-Juicer currently provides some commonly used OPs for document-level and string-level deduplication, such as exact-match and MinHash+LSH. For sentence-level deduplication, Data-Juicer offers reference implementations based on embeddings, based on n-gram coverage (one may need to call text_chunk_mapper first), and based on LLMs (one may need to feed multiple sentences and adjusting prompts).
We are exploring more advanced deduplication techniques, including fast embeddings and model rewarding. Also, we welcome more community discussions and contributions!
Search before continuing 先搜索,再继续
Description 描述
请问怎么做句子级的去重,有没有推荐的开源实现?谢谢
Use case 使用场景
No response
Additional 额外信息
No response
Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR?
The text was updated successfully, but these errors were encountered: