-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
equal operation to perform unbatch for huggingface datasets #2767
Comments
Hi @lhoestq |
Hi, |
Hi,
import collections
def unbatch(batch):
new_batch = collections.defaultdict(list)
keys = batch.keys()
for values in zip(*batch.values()):
ex = {k: v for k, v in zip(keys, values)}
inputs = f"record query: {ex['query']} entities: {', '.join(ex['entities'])} passage: {ex['passage']}"
new_batch["inputs"].extend([inputs] * len(ex["answers"]))
new_batch["targets"].extend(ex["answers"])
return new_batch
dset = dset.map(unbatch, batched=True, remove_columns=dset.column_names) |
Dear @mariosasko thank you very much for your help and time. |
Hi @mariosasko
Thanks a lot again, this was a super great way to do it. |
Hi
I need to use "unbatch" operation in tensorflow on a huggingface dataset, I could not find this operation, could you kindly direct me how I can do it, here is the problem I am trying to solve:
I am considering "record" dataset in SuperGlue and I need to replicate each entery of the dataset for each answer, to make it similar to what T5 originally did:
https://github.com/google-research/text-to-text-transfer-transformer/blob/3c58859b8fe72c2dbca6a43bc775aa510ba7e706/t5/data/preprocessors.py#L925
Here please find an example:
For example, a typical example from ReCoRD might look like
{
'passsage': 'This is the passage.',
'query': 'A @Placeholder is a bird.',
'entities': ['penguin', 'potato', 'pigeon'],
'answers': ['penguin', 'pigeon'],
}
and I need a prosessor which would turn this example into the following two examples:
{
'inputs': 'record query: A @Placeholder is a bird. entities: penguin, '
'potato, pigeon passage: This is the passage.',
'targets': 'penguin',
}
and
{
'inputs': 'record query: A @Placeholder is a bird. entities: penguin, '
'potato, pigeon passage: This is the passage.',
'targets': 'pigeon',
}
For doing this, one need unbatch, as each entry can map to multiple samples depending on the number of answers, I am not sure how to perform this operation with huggingface datasets library and greatly appreciate your help
@lhoestq
Thank you very much.
The text was updated successfully, but these errors were encountered: