Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

equal operation to perform unbatch for huggingface datasets #2767

Closed
dorooddorood606 opened this issue Aug 6, 2021 · 5 comments
Closed

equal operation to perform unbatch for huggingface datasets #2767

dorooddorood606 opened this issue Aug 6, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@dorooddorood606
Copy link

dorooddorood606 commented Aug 6, 2021

Hi
I need to use "unbatch" operation in tensorflow on a huggingface dataset, I could not find this operation, could you kindly direct me how I can do it, here is the problem I am trying to solve:

I am considering "record" dataset in SuperGlue and I need to replicate each entery of the dataset for each answer, to make it similar to what T5 originally did:

https://github.com/google-research/text-to-text-transfer-transformer/blob/3c58859b8fe72c2dbca6a43bc775aa510ba7e706/t5/data/preprocessors.py#L925

Here please find an example:

For example, a typical example from ReCoRD might look like
{
'passsage': 'This is the passage.',
'query': 'A @Placeholder is a bird.',
'entities': ['penguin', 'potato', 'pigeon'],
'answers': ['penguin', 'pigeon'],
}
and I need a prosessor which would turn this example into the following two examples:
{
'inputs': 'record query: A @Placeholder is a bird. entities: penguin, '
'potato, pigeon passage: This is the passage.',
'targets': 'penguin',
}
and
{
'inputs': 'record query: A @Placeholder is a bird. entities: penguin, '
'potato, pigeon passage: This is the passage.',
'targets': 'pigeon',
}

For doing this, one need unbatch, as each entry can map to multiple samples depending on the number of answers, I am not sure how to perform this operation with huggingface datasets library and greatly appreciate your help

@lhoestq

Thank you very much.

@dorooddorood606 dorooddorood606 added the bug Something isn't working label Aug 6, 2021
@dorooddorood606
Copy link
Author

dorooddorood606 commented Aug 6, 2021

Hi @lhoestq
Maybe this is clearer to explain like this, currently map function, map one example to "one" modified one, lets assume we want to map one example to "multiple" examples, in which we do not know in advance how many examples they would be per each entry. I greatly appreciate telling me how I can handle this operation, thanks a lot

@jackfeinmann5
Copy link

Hi,
this is also my question on how to perform similar operation as "unbatch" in tensorflow in great huggingface dataset library.
thanks.

@mariosasko
Copy link
Collaborator

Hi,

Dataset.map in the batched mode allows you to map a single row to multiple rows. So to perform "unbatch", you can do the following:

import collections

def unbatch(batch):
    new_batch = collections.defaultdict(list)
    keys = batch.keys()
    for values in zip(*batch.values()):
        ex = {k: v for k, v in zip(keys, values)}
        inputs = f"record query: {ex['query']} entities: {', '.join(ex['entities'])} passage: {ex['passage']}"
        new_batch["inputs"].extend([inputs] * len(ex["answers"]))
        new_batch["targets"].extend(ex["answers"])
    return new_batch

dset = dset.map(unbatch, batched=True, remove_columns=dset.column_names)

@dorooddorood606
Copy link
Author

dorooddorood606 commented Aug 7, 2021

Dear @mariosasko
First, thank you very much for coming back to me on this, I appreciate it a lot. I tried this solution, I am getting errors, do you mind
giving me one test example to be able to run your code, to understand better the format of the inputs to your function?
in this function https://github.com/google-research/text-to-text-transfer-transformer/blob/3c58859b8fe72c2dbca6a43bc775aa510ba7e706/t5/data/preprocessors.py#L952 they copy each example to the number of "answers", do you mean one should not do the copying part and use directly your function?

thank you very much for your help and time.

@dorooddorood606
Copy link
Author

Hi @mariosasko
I think finally I got this, I think you mean to do things in one step, here is the full example for completeness:

def unbatch(batch):
    new_batch = collections.defaultdict(list)
    keys = batch.keys()
    for values in zip(*batch.values()):
        ex = {k: v for k, v in zip(keys, values)}
        # updates the passage.
        passage = ex['passage']
        passage = re.sub(r'(\.|\?|\!|\"|\')\n@highlight\n', r'\1 ', passage)
        passage = re.sub(r'\n@highlight\n', '. ', passage)
        inputs = f"record query: {ex['query']} entities: {', '.join(ex['entities'])} passage: {passage}"
        # duplicates the samples based on  number of answers.
        num_answers = len(ex["answers"])
        num_duplicates = np.maximum(1, num_answers)
        new_batch["inputs"].extend([inputs] * num_duplicates) #len(ex["answers"]))
        new_batch["targets"].extend(ex["answers"] if num_answers > 0 else ["<unk>"])
    return new_batch

data = datasets.load_dataset('super_glue', 'record', split="train", script_version="master")
data = data.map(unbatch, batched=True, remove_columns=data.column_names)

Thanks a lot again, this was a super great way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants