-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tokenizer]Add Chat template #8226
Conversation
Thanks for your contribution! |
移除 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8226 +/- ##
===========================================
- Coverage 55.37% 55.36% -0.01%
===========================================
Files 613 614 +1
Lines 95870 95412 -458
===========================================
- Hits 53084 52824 -260
+ Misses 42786 42588 -198 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你可能还需要做以下几个事情:
- 更新所有支持 chat-template 的 tokenizer_config.json
- 测试一下先有的多轮对话训练、推理等流程,保证全流程正确。
) -> str | dict[str, numpy.ndarray | paddle.Tensor]: | ||
if isinstance(conversation, str): | ||
conversations = [{"role": "user", "content": conversation}] | ||
elif isinstance(conversation, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此外,也测试过新旧 chat_template 在 Predictor 中的使用是否符合预期,同时还要测试一下 gradio_ui 能够使用新旧 chat_template。
@@ -692,6 +753,70 @@ def encode_chat_inputs(self, conversations: List[List[str, str]], context_data: | |||
result["conversations"] = conversation_ids | |||
return result | |||
|
|||
def _encode_chat_inputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果脱离了之前设计的训推一体的 ChatTemplate,这个函数的适用性应该还挺低的,根本用不了。
所以,不太建议将 encode_chat_inputs 这块逻辑写到 tokenizer 里面去,尽量写到前处理里面去。
所以,这块的调整可能 就比较大了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑到目前encode_chat_inputs
函数使用较广,移除造成影响范围可能较大。是否可以考虑以下策略:
默认tgt src切分方式为 src中不含有bot start token:即tgt中含有完整的user轮+bot start token
如果需要重写,则在tokenizer类中单独定义:如qwen
PR types
Function optimization
PR changes
APIs
Description
Add
chat_template
in config file to load template