Align the Dataset and IterableDataset processing API #3444

lhoestq · 2021-12-16T11:26:11Z

Intro

items marked like ~~this~~ are done already :)

Currently the two classes have two distinct API for processing:

The `.map()` method

Both have those parameters in common: function, batched, batch_size

IterableDataset is missing those parameters:
~~with_indices~~, with_rank, ~~input_columns~~, ~~drop_last_batch~~, ~~remove_columns~~, features, disable_nullable, fn_kwargs, num_proc
Dataset also has additional parameters that are exclusive, due to caching:
keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, suffix_template, new_fingerprint
There is also an important difference in terms of behavior:
Dataset.map adds new columns (with dict.update)
BUT
IterableDataset discards previous columns (it overwrites the dict)
IMO the two methods should have the same behavior. This would be an important breaking change though.
Dataset.map is eager while IterableDataset.map is lazy

The `.shuffle()` method

~~Both have an optional seed parameter, but IterableDataset requires a mandatory parameter buffer_size to control the size of the local buffer used for approximate shuffling.~~
~~IterableDataset is missing the parameter generator~~
Also Dataset has exclusive parameters due to caching: keep_in_memory, load_from_cache_file, indices_cache_file_name, writer_batch_size, new_fingerprint

The `.with_format()` method

IterableDataset only supports "torch" (it misses tf, jax, pandas, arrow) and is missing the parameters: columns, output_all_columns and format_kwargs
other methods like set_format, reset_format or formatted_as are also missing

Other methods

Both have the same remove_columns method
IterableDataset is missing: ~~cast~~, ~~cast_column~~, ~~filter~~, ~~rename_column~~, ~~rename_columns~~, class_encode_column, flatten, prepare_for_task, train_test_split, shard
Some other methods are missing but we can discuss them: set_transform, formatted_as, with_transform
And others don't really make sense for an iterable dataset: select, sort, add_column, add_item
Dataset is missing skip and take, that IterableDataset implements.

Questions

I think it would be nice to be able to switch between streaming and regular dataset easily, without changing the processing code significantly.

What should be aligned and what shouldn't between those two APIs ?

IMO the minimum is to align the main processing methods.

It would mean aligning breaking the current Iterable.map to have the same behavior as Dataset.map (add columns with dict.update), and add multiprocessing as well as the missing parameters. DONE ✅

It would also mean implementing the missing methods: cast, cast_column, filter, rename_column, rename_columns, class_encode_column, flatten, prepare_for_task, train_test_split, shard. WIP 🟠

What are the breaking changes for IterableDataset ?

The main breaking change would be the change of behavior of IterableDataset.map, because currently it discards all the previous columns instead of keeping them. DONE ✅

Shall we also do some changes for regular datasets ?

I agree the simplest would be to have the exact same methods for both Dataset and IterableDataset. However this is probably not a good idea because it would prevent users from using the best benefits of them. That's why we can keep some aspects of regular datasets as they are:

keep the eager Dataset.map with caching
keep the with_transform method for lazy processing
keep Dataset.select (it could also be added to IterableDataset even though it's not recommended)

We could have a completely aligned map method if both methods were lazy by default, but this is a very big breaking change so I'm not sure we can consider doing that.

For information, TFDS does lazy map by default, and has an additional .cache() method.

Opinions ?

I'd love to gather some opinions about this here. If the two APIs are more aligned it would be awesome for the examples in transformers, and it would create a satisfactory experience for users that want to switch from one mode to the other.

cc @mariosasko @albertvillanova @thomwolf @patrickvonplaten @sgugger

The text was updated successfully, but these errors were encountered:

thomwolf · 2021-12-16T11:33:19Z

Yes I agree, these should be as aligned as possible. Maybe we can also check the feedback in the survey at http://hf.co/oss-survey and see if people mentioned related things on the API (in particular if we go the breaking change way, it would be good to be sure we are taking the right direction for the community).

mariosasko · 2021-12-16T13:52:18Z

I like this proposal.

There is also an important difference in terms of behavior:
Dataset.map adds new columns (with dict.update)
BUT
IterableDataset discards previous columns (it overwrites the dict)
IMO the two methods should have the same behavior. This would be an important breaking change though.

The main breaking change would be the change of behavior of IterableDataset.map, because currently it discards all the previous columns instead of keeping them.

Yes, this behavior of IterableDataset.map was surprising to me the first time I used it because I was expecting the same behavior as Dataset.map, so I'm OK with the breaking change here.

IterableDataset only supports "torch" (it misses tf, jax, pandas, arrow) and is missing the parameters: columns, output_all_columns and format_kwargs

+ it's also missing the actual formatting code (we return unformatted tensors)

We could have a completely aligned map method if both methods were lazy by default, but this is a very big breaking change so I'm not sure we can consider doing that.

For information, TFDS does lazy map by default, and has an additional .cache() method.

If I understand this part correctly, the idea would be for Dataset.map to behave similarly to Dataset.with_transform (lazy processing) and to have an option to cache processed data (with .cache()). This idea is really nice because it can also be applied to IterableDataset to fix #3142 (again we get the aligned APIs). However, this change would break a lot of things, so I'm still not sure if this is a step in the right direction (maybe it's OK for Datasets 2.0?)

If the two APIs are more aligned it would be awesome for the examples in transformers, and it would create a satisfactory experience for users that want to switch from one mode to the other.

Yes, it would be amazing to have an option to easily switch between these two modes.

I agree with the rest.

lhoestq · 2021-12-16T14:27:10Z

If I understand this part correctly, the idea would be for Dataset.map to behave similarly to Dataset.with_transform (lazy processing) and to have an option to cache processed data (with .cache()). This idea is really nice because it can also be applied to IterableDataset to fix #3142 (again we get the aligned APIs). However, this change would break a lot of things, so I'm still not sure if this is a step in the right direction (maybe it's OK for Datasets 2.0?)

Yea this is too big of a change in my opinion. Anyway it's fine as it is right now with streaming=lazy and regular=eager.

tshu-w · 2022-03-20T08:44:58Z

Hi, IterableDataset is also missing set_format.

lhoestq · 2022-03-21T14:45:21Z

Yes indeed, thanks. I added it to the list of methods to align in the first post

gcaillaut · 2022-06-16T14:59:10Z

I just encountered the problem of the missing fn_kwargs parameter in the map method. I am commenting to give a workaround in case someone has the same problem and does not find a solution.
You can wrap your function call inside a class that contains the other parameters needed by the function called by map, like this:

def my_func(x, y, z):
    # Do things

class MyFuncWrapper:
    def __init__(self, y, z):
        self.y = y
        self.z = z

    def __call__(self, x):
        return my_func(x, self.y, self.z)

Then, give an instance of the MyFuncWrapper to the map function.

npuichigo · 2023-08-12T08:21:59Z

Any update on this? It's almost 2024😂 @lhoestq

lhoestq · 2023-08-16T09:28:17Z

The main differences have been addressed (map, formatting) but there are still a few things to implement like Dataset.take, Dataset.skip, IterableDataset.set_format, IterableDataset.formatted_as, IterableDataset.reset_format.

The rest cannot be implemented for the general case. E.g. train_test_split and select can only work on an iterable dataset if the underlying dataset format allows it (we need to know the number of rows and have some sort of random access)

amin-nejad · 2024-10-08T11:51:19Z

It appears IterableDataset now supports all the formats apart from pandas but the documentation doesn't have any mention of it yet. The docstring of with_format seems like it's even older incorrectly saying it only supports arrow. Are there any plans to update the documentation and have some guides on best practices?

lhoestq · 2024-10-08T13:06:19Z

Thanks, I updated the docstrings. Would be cool to have more examples in the docs though, if this is something you'd like to contribute ;)

lhoestq added enhancement New feature or request generic discussion Generic discussion on the library labels Dec 21, 2021

mariosasko mentioned this issue Feb 18, 2022

Expanding streaming capabilities #3753

Open

lhoestq self-assigned this Mar 1, 2022

mariosasko mentioned this issue Jan 30, 2023

Load a cached dataset as iterable #5481

Open

yuukicammy mentioned this issue Apr 30, 2023

Add fn_kwargs to map and filter of IterableDataset and IterableDatasetDict #5810

Merged

3 tasks

lhoestq mentioned this issue May 4, 2023

IterableDataset Arrow formatting #5821

Merged

This was referenced May 11, 2023

Load from Hugging Face hub component ml6team/fondant#100

Closed

Write to Hugging Face hub component ml6team/fondant#113

Closed

scottjlee mentioned this issue Oct 6, 2023

[Data] Check for different batch formats returned from HuggingFace Datasets ray-project/ray#40156

Merged

8 tasks

lhoestq mentioned this issue Oct 8, 2024

with_format docstring #7203

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align the Dataset and IterableDataset processing API #3444

Align the Dataset and IterableDataset processing API #3444

lhoestq commented Dec 16, 2021 •

edited

Loading

thomwolf commented Dec 16, 2021

mariosasko commented Dec 16, 2021 •

edited

Loading

lhoestq commented Dec 16, 2021

tshu-w commented Mar 20, 2022

lhoestq commented Mar 21, 2022

gcaillaut commented Jun 16, 2022

npuichigo commented Aug 12, 2023 •

edited

Loading

lhoestq commented Aug 16, 2023

amin-nejad commented Oct 8, 2024

lhoestq commented Oct 8, 2024

Align the Dataset and IterableDataset processing API #3444

Align the Dataset and IterableDataset processing API #3444

Comments

lhoestq commented Dec 16, 2021 • edited Loading

Intro

The .map() method

The .shuffle() method

The .with_format() method

Other methods

Questions

Opinions ?

thomwolf commented Dec 16, 2021

mariosasko commented Dec 16, 2021 • edited Loading

lhoestq commented Dec 16, 2021

tshu-w commented Mar 20, 2022

lhoestq commented Mar 21, 2022

gcaillaut commented Jun 16, 2022

npuichigo commented Aug 12, 2023 • edited Loading

lhoestq commented Aug 16, 2023

amin-nejad commented Oct 8, 2024

lhoestq commented Oct 8, 2024

lhoestq commented Dec 16, 2021 •

edited

Loading

The `.map()` method

The `.shuffle()` method

The `.with_format()` method

mariosasko commented Dec 16, 2021 •

edited

Loading

npuichigo commented Aug 12, 2023 •

edited

Loading