Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New schema: Add chat schema #679

Merged
merged 4 commits into from
May 31, 2024

Conversation

patrickamadeus
Copy link
Collaborator

Adding new chat schema to support #635

{
"id": datasets.Value("string"),
"input": datasets.Sequence({
"role": datasets.ClassLabel(names=["system", "user", "assistant"]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the role should be implemented as string instead of ClassLabel due to 2 reasons:

  1. There might be some cases where the role can be multiple for a single-sequence dialogues
  2. The actual data (when the examples are generated) will result in an integer instead of string

# 2. defining meta as dict of key with intended colname meta and its val with dataset.Features class
# in `_info` Dataloader method then populate it with the values in `_general_examples` Dataloader method
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add whitespace at EoF

@@ -42,4 +44,4 @@
"text2text_features",
"video_features",
"tod_features",
]
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add whitespace at EoF

@sabilmakbar
Copy link
Collaborator

note to Holy & Sam:

btw we might want to revisit the ToD (Task-Oriented Dialogue) & DS (Dialogue System) tasks later on whether we should use this schema too (since HF tokenizers already support the format of chat -- specifically on the input data, which has the same schema as HF Chat).

@holylovenia
Copy link
Contributor

note to Holy & Sam:

btw we might want to revisit the ToD (Task-Oriented Dialogue) & DS (Dialogue System) tasks later on whether we should use this schema too (since HF tokenizers already support the format of chat -- specifically on the input data, which has the same schema as HF Chat).

Hi @sabilmakbar, sorry for the late response. Replied on #635.

@holylovenia
Copy link
Contributor

Hi @patrickamadeus and @sabilmakbar, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

add whitespace at EoF
change the `role` field schema from ClassLabel to string
@sabilmakbar
Copy link
Collaborator

did the changes bcs the initial assignee had no response

@sabilmakbar sabilmakbar merged commit 0665774 into SEACrowd:master May 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants