Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

是否可以通过过滤同时得到中英文 #125

Closed
3 tasks done
pkugyf opened this issue Dec 7, 2023 · 2 comments · Fixed by #151
Closed
3 tasks done

是否可以通过过滤同时得到中英文 #125

pkugyf opened this issue Dec 7, 2023 · 2 comments · Fixed by #151
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@pkugyf
Copy link

pkugyf commented Dec 7, 2023

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

使用language_id_score_filter算子可以过滤得到中文或英文,但如果想同时保留中英文应该怎么做?
显然如果复制2份language_id_score_filter算子,第一步过滤英文,第二步过滤中文是不可行的,因为在第一步就把中文过滤掉了

Additional 额外信息

No response

@pkugyf pkugyf added the question Further information is requested label Dec 7, 2023
@HYLcool
Copy link
Collaborator

HYLcool commented Dec 7, 2023

嗨,感谢你的建议!

目前language_id_score_filter的确只能保留某一种语言,但是你的这个建议我们认为非常好,我们会考虑在之后使这个算子支持保留多种语言的样本,不过这可能需要一些开发时间~

期望你继续保持关注!

@HYLcool HYLcool added the enhancement New feature or request label Dec 7, 2023
@HYLcool HYLcool self-assigned this Dec 7, 2023
@HYLcool HYLcool moved this from Todo to In Progress in data-juicer Dec 21, 2023
@HYLcool HYLcool linked a pull request Dec 22, 2023 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in data-juicer Dec 26, 2023
@HYLcool
Copy link
Collaborator

HYLcool commented Dec 26, 2023

你好,现在main分支的最新版本代码中,language_id_score_filter算子已经支持了同时保留多种语言,一个例子如下:

process:
  - language_id_score_filter:
      lang: [en, zh]  # 参数为待保留的多种语言的列表
      min_score: 0.9

再次感谢你的建议~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants