equal operation to perform unbatch for huggingface datasets #2767

dorooddorood606 · 2021-08-06T19:45:52Z

Hi
I need to use "unbatch" operation in tensorflow on a huggingface dataset, I could not find this operation, could you kindly direct me how I can do it, here is the problem I am trying to solve:

I am considering "record" dataset in SuperGlue and I need to replicate each entery of the dataset for each answer, to make it similar to what T5 originally did:

https://github.com/google-research/text-to-text-transfer-transformer/blob/3c58859b8fe72c2dbca6a43bc775aa510ba7e706/t5/data/preprocessors.py#L925

Here please find an example:

For example, a typical example from ReCoRD might look like
{
'passsage': 'This is the passage.',
'query': 'A @Placeholder is a bird.',
'entities': ['penguin', 'potato', 'pigeon'],
'answers': ['penguin', 'pigeon'],
}
and I need a prosessor which would turn this example into the following two examples:
{
'inputs': 'record query: A @Placeholder is a bird. entities: penguin, '
'potato, pigeon passage: This is the passage.',
'targets': 'penguin',
}
and
{
'inputs': 'record query: A @Placeholder is a bird. entities: penguin, '
'potato, pigeon passage: This is the passage.',
'targets': 'pigeon',
}

For doing this, one need unbatch, as each entry can map to multiple samples depending on the number of answers, I am not sure how to perform this operation with huggingface datasets library and greatly appreciate your help

@lhoestq

Thank you very much.

dorooddorood606 · 2021-08-06T19:57:56Z

Hi @lhoestq
Maybe this is clearer to explain like this, currently map function, map one example to "one" modified one, lets assume we want to map one example to "multiple" examples, in which we do not know in advance how many examples they would be per each entry. I greatly appreciate telling me how I can handle this operation, thanks a lot

jackfeinmann5 · 2021-08-07T09:22:39Z

Hi,
this is also my question on how to perform similar operation as "unbatch" in tensorflow in great huggingface dataset library.
thanks.

mariosasko · 2021-08-07T16:56:00Z

Hi,

Dataset.map in the batched mode allows you to map a single row to multiple rows. So to perform "unbatch", you can do the following:

import collections

def unbatch(batch):
    new_batch = collections.defaultdict(list)
    keys = batch.keys()
    for values in zip(*batch.values()):
        ex = {k: v for k, v in zip(keys, values)}
        inputs = f"record query: {ex['query']} entities: {', '.join(ex['entities'])} passage: {ex['passage']}"
        new_batch["inputs"].extend([inputs] * len(ex["answers"]))
        new_batch["targets"].extend(ex["answers"])
    return new_batch

dset = dset.map(unbatch, batched=True, remove_columns=dset.column_names)

dorooddorood606 · 2021-08-07T18:58:11Z

Dear @mariosasko
First, thank you very much for coming back to me on this, I appreciate it a lot. I tried this solution, I am getting errors, do you mind
giving me one test example to be able to run your code, to understand better the format of the inputs to your function?
in this function https://github.com/google-research/text-to-text-transfer-transformer/blob/3c58859b8fe72c2dbca6a43bc775aa510ba7e706/t5/data/preprocessors.py#L952 they copy each example to the number of "answers", do you mean one should not do the copying part and use directly your function?

thank you very much for your help and time.

dorooddorood606 · 2021-08-07T19:56:20Z

Hi @mariosasko
I think finally I got this, I think you mean to do things in one step, here is the full example for completeness:

def unbatch(batch):
    new_batch = collections.defaultdict(list)
    keys = batch.keys()
    for values in zip(*batch.values()):
        ex = {k: v for k, v in zip(keys, values)}
        # updates the passage.
        passage = ex['passage']
        passage = re.sub(r'(\.|\?|\!|\"|\')\n@highlight\n', r'\1 ', passage)
        passage = re.sub(r'\n@highlight\n', '. ', passage)
        inputs = f"record query: {ex['query']} entities: {', '.join(ex['entities'])} passage: {passage}"
        # duplicates the samples based on  number of answers.
        num_answers = len(ex["answers"])
        num_duplicates = np.maximum(1, num_answers)
        new_batch["inputs"].extend([inputs] * num_duplicates) #len(ex["answers"]))
        new_batch["targets"].extend(ex["answers"] if num_answers > 0 else ["<unk>"])
    return new_batch

data = datasets.load_dataset('super_glue', 'record', split="train", script_version="master")
data = data.map(unbatch, batched=True, remove_columns=data.column_names)

Thanks a lot again, this was a super great way to do it.

dorooddorood606 added the bug Something isn't working label Aug 6, 2021

mariosasko closed this as completed Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

equal operation to perform unbatch for huggingface datasets #2767

equal operation to perform unbatch for huggingface datasets #2767

dorooddorood606 commented Aug 6, 2021 •

edited

Loading

dorooddorood606 commented Aug 6, 2021 •

edited

Loading

jackfeinmann5 commented Aug 7, 2021

mariosasko commented Aug 7, 2021

dorooddorood606 commented Aug 7, 2021 •

edited

Loading

dorooddorood606 commented Aug 7, 2021

equal operation to perform unbatch for huggingface datasets #2767

equal operation to perform unbatch for huggingface datasets #2767

Comments

dorooddorood606 commented Aug 6, 2021 • edited Loading

dorooddorood606 commented Aug 6, 2021 • edited Loading

jackfeinmann5 commented Aug 7, 2021

mariosasko commented Aug 7, 2021

dorooddorood606 commented Aug 7, 2021 • edited Loading

dorooddorood606 commented Aug 7, 2021

dorooddorood606 commented Aug 6, 2021 •

edited

Loading

dorooddorood606 commented Aug 6, 2021 •

edited

Loading

dorooddorood606 commented Aug 7, 2021 •

edited

Loading