Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UnboundLocalError if preprocessing returns an empty list #6346

Merged
merged 6 commits into from
Oct 25, 2023

Conversation

cwallenwein
Copy link
Contributor

@cwallenwein cwallenwein commented Oct 24, 2023

If this tokenization function is used with IterableDatasets and no sample is as big as the context length, input_batch will be an empty list.

def tokenize(batch, tokenizer, context_length):
    outputs = tokenizer(
        batch["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

dataset.map(tokenize, batched=True, batch_size=batch_size, fn_kwargs={"context_length": context_length, "tokenizer": tokenizer}, remove_columns=dataset.column_names)

This will throw the following error: UnboundLocalError: local variable 'batch_idx' referenced before assignment, because the for loop was not executed a single time

for batch_idx, example in enumerate(_batch_to_examples(transformed_batch)):
     yield new_key, example
current_idx += batch_idx + 1

Some of the possible solutions

for batch_idx, example in enumerate(_batch_to_examples(transformed_batch)):
     yield new_key, example
try:
     current_idx += batch_idx + 1
except:
     current_idx += 1

or

batch_idx = 0
for batch_idx, example in enumerate(_batch_to_examples(transformed_batch)):
     yield new_key, example
current_idx += batch_idx + 1

For causal language modeling, if you're using the preprocessing as described in this HF YouTube video:
https://youtu.be/ma1TrR7gE7I?si=T1PdOEvcQwDJuGtt&t=117

Throws the following error if no sample is longer than the context length:
UnboundLocalError: local variable 'batch_idx' referenced before assignment
@cwallenwein cwallenwein changed the title Fix UnboundLocalError if preprocessing doesn't return a batch Fix UnboundLocalError if preprocessing doesn't returns an empty list Oct 24, 2023
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The batched .filter also has this issue, so feel free to fix it too as part of this PR.

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 24, 2023

The documentation is not available anymore as the PR was closed or merged.

cwallenwein

This comment was marked as duplicate.

cwallenwein and others added 2 commits October 25, 2023 10:12
Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
@mariosasko mariosasko merged commit 3ab9de6 into huggingface:main Oct 25, 2023
11 of 12 checks passed
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009286 / 0.011353 (-0.002067) 0.005478 / 0.011008 (-0.005530) 0.109768 / 0.038508 (0.071260) 0.088460 / 0.023109 (0.065351) 0.387664 / 0.275898 (0.111766) 0.457379 / 0.323480 (0.133899) 0.006517 / 0.007986 (-0.001469) 0.004037 / 0.004328 (-0.000292) 0.083911 / 0.004250 (0.079661) 0.071658 / 0.037052 (0.034605) 0.385065 / 0.258489 (0.126576) 0.460928 / 0.293841 (0.167087) 0.048062 / 0.128546 (-0.080484) 0.016343 / 0.075646 (-0.059303) 0.373675 / 0.419271 (-0.045597) 0.067640 / 0.043533 (0.024108) 0.391730 / 0.255139 (0.136591) 0.432908 / 0.283200 (0.149708) 0.035748 / 0.141683 (-0.105935) 1.767625 / 1.452155 (0.315471) 1.965606 / 1.492716 (0.472889)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.277405 / 0.018006 (0.259399) 0.538448 / 0.000490 (0.537958) 0.013795 / 0.000200 (0.013595) 0.000518 / 0.000054 (0.000464)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043962 / 0.037411 (0.006550) 0.115305 / 0.014526 (0.100780) 0.117572 / 0.176557 (-0.058985) 0.182168 / 0.737135 (-0.554968) 0.114833 / 0.296338 (-0.181505)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.604209 / 0.215209 (0.389000) 6.186113 / 2.077655 (4.108458) 2.771067 / 1.504120 (1.266947) 2.425420 / 1.541195 (0.884226) 2.475200 / 1.468490 (1.006710) 0.887096 / 4.584777 (-3.697681) 5.214349 / 3.745712 (1.468637) 4.989606 / 5.269862 (-0.280256) 3.092135 / 4.565676 (-1.473541) 0.104464 / 0.424275 (-0.319811) 0.008994 / 0.007607 (0.001387) 0.732819 / 0.226044 (0.506775) 7.396007 / 2.268929 (5.127078) 3.371167 / 55.444624 (-52.073457) 2.645475 / 6.876477 (-4.231001) 2.704215 / 2.142072 (0.562143) 1.034724 / 4.805227 (-3.770504) 0.219063 / 6.500664 (-6.281601) 0.073863 / 0.075469 (-0.001606)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.625020 / 1.841788 (-0.216768) 23.369980 / 8.074308 (15.295671) 22.480951 / 10.191392 (12.289559) 0.228219 / 0.680424 (-0.452204) 0.026981 / 0.534201 (-0.507220) 0.487670 / 0.579283 (-0.091613) 0.582310 / 0.434364 (0.147946) 0.539182 / 0.540337 (-0.001156) 0.791962 / 1.386936 (-0.594974)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008657 / 0.011353 (-0.002696) 0.004971 / 0.011008 (-0.006037) 0.089499 / 0.038508 (0.050991) 0.075963 / 0.023109 (0.052854) 0.497719 / 0.275898 (0.221821) 0.507912 / 0.323480 (0.184432) 0.006067 / 0.007986 (-0.001919) 0.004118 / 0.004328 (-0.000210) 0.079397 / 0.004250 (0.075146) 0.059181 / 0.037052 (0.022129) 0.501108 / 0.258489 (0.242619) 0.565792 / 0.293841 (0.271951) 0.048818 / 0.128546 (-0.079729) 0.014813 / 0.075646 (-0.060833) 0.093863 / 0.419271 (-0.325409) 0.060824 / 0.043533 (0.017292) 0.489289 / 0.255139 (0.234150) 0.533624 / 0.283200 (0.250425) 0.034997 / 0.141683 (-0.106685) 1.770574 / 1.452155 (0.318419) 1.837213 / 1.492716 (0.344496)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.237319 / 0.018006 (0.219313) 0.594976 / 0.000490 (0.594486) 0.008888 / 0.000200 (0.008688) 0.000124 / 0.000054 (0.000070)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036955 / 0.037411 (-0.000456) 0.097825 / 0.014526 (0.083299) 0.111139 / 0.176557 (-0.065418) 0.174776 / 0.737135 (-0.562359) 0.117755 / 0.296338 (-0.178584)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.606498 / 0.215209 (0.391289) 6.089874 / 2.077655 (4.012219) 2.811135 / 1.504120 (1.307015) 2.428486 / 1.541195 (0.887292) 2.399512 / 1.468490 (0.931022) 0.823492 / 4.584777 (-3.761285) 4.897107 / 3.745712 (1.151395) 4.407589 / 5.269862 (-0.862272) 2.868442 / 4.565676 (-1.697235) 0.098774 / 0.424275 (-0.325502) 0.007998 / 0.007607 (0.000391) 0.699489 / 0.226044 (0.473445) 7.139214 / 2.268929 (4.870285) 3.511158 / 55.444624 (-51.933466) 2.775459 / 6.876477 (-4.101018) 2.951549 / 2.142072 (0.809477) 1.006921 / 4.805227 (-3.798306) 0.200105 / 6.500664 (-6.300559) 0.071064 / 0.075469 (-0.004405)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.680599 / 1.841788 (-0.161189) 23.399777 / 8.074308 (15.325469) 21.776357 / 10.191392 (11.584965) 0.264697 / 0.680424 (-0.415726) 0.034272 / 0.534201 (-0.499929) 0.506984 / 0.579283 (-0.072299) 0.609556 / 0.434364 (0.175192) 0.599014 / 0.540337 (0.058677) 0.824068 / 1.386936 (-0.562868)

@cwallenwein cwallenwein changed the title Fix UnboundLocalError if preprocessing doesn't returns an empty list Fix UnboundLocalError if preprocessing returns an empty list Oct 25, 2023
albertvillanova pushed a commit that referenced this pull request Nov 15, 2023
…6346)

* Fix 

For causal language modeling, if you're using the preprocessing as described in this HF YouTube video:
https://youtu.be/ma1TrR7gE7I?si=T1PdOEvcQwDJuGtt&t=117

Throws the following error if no sample is longer than the context length:
UnboundLocalError: local variable 'batch_idx' referenced before assignment

* Apply suggestions from code review

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

* Fix indent + improve readability

* Fixes

* Style

* Style again

---------

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants