Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch map raises TypeError: '>=' not supported between instances of 'NoneType' and 'int' #6022

Closed
codingl2k1 opened this issue Jul 12, 2023 · 1 comment · Fixed by #6023
Closed

Comments

@codingl2k1
Copy link

Describe the bug

When mapping some datasets with batched=True, datasets may raise an exeception:

Traceback (most recent call last):
  File "/Users/codingl2k1/Work/datasets/venv/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1328, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 3483, in _map_single
    writer.write_batch(batch)
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_writer.py", line 549, in write_batch
    array = cast_array_to_feature(col_values, col_type) if col_type is not None else col_values
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 1831, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 1831, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 2063, in cast_array_to_feature
    return feature.cast_storage(array)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/features/features.py", line 1098, in cast_storage
    if min_max["max"] >= self.num_classes:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/codingl2k1/Work/datasets/t1.py", line 33, in <module>
    ds = ds.map(transforms, num_proc=14, batched=True, batch_size=5)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/dataset_dict.py", line 850, in map
    {
  File "/Users/codingl2k1/Work/datasets/src/datasets/dataset_dict.py", line 851, in <dictcomp>
    k: dataset.map(
       ^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 577, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 542, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 3179, in map
    for rank, done, content in iflatmap_unordered(
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1368, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1368, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/venv/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Steps to reproduce the bug

  1. Checkout the latest main of datasets.
  2. Run the code:
from datasets import load_dataset

def transforms(examples):
    # examples["pixel_values"] = [image.convert("RGB").resize((100, 100)) for image in examples["image"]]
    return examples

ds = load_dataset("scene_parse_150")
ds = ds.map(transforms, num_proc=14, batched=True, batch_size=5)
print(ds)

Expected behavior

map without exception.

Environment info

Datasets: b8067c0
Python: 3.11.4
System: Macos

@mariosasko
Copy link
Collaborator

Thanks for reporting! I've opened a PR with a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants