-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align the Dataset and IterableDataset processing API #3444
Comments
Yes I agree, these should be as aligned as possible. Maybe we can also check the feedback in the survey at http://hf.co/oss-survey and see if people mentioned related things on the API (in particular if we go the breaking change way, it would be good to be sure we are taking the right direction for the community). |
I like this proposal.
Yes, this behavior of
+ it's also missing the actual formatting code (we return unformatted tensors)
If I understand this part correctly, the idea would be for
Yes, it would be amazing to have an option to easily switch between these two modes. I agree with the rest. |
Yea this is too big of a change in my opinion. Anyway it's fine as it is right now with streaming=lazy and regular=eager. |
Hi, IterableDataset is also missing set_format. |
Yes indeed, thanks. I added it to the list of methods to align in the first post |
I just encountered the problem of the missing def my_func(x, y, z):
# Do things
class MyFuncWrapper:
def __init__(self, y, z):
self.y = y
self.z = z
def __call__(self, x):
return my_func(x, self.y, self.z) Then, give an instance of the |
Any update on this? It's almost 2024😂 @lhoestq |
The main differences have been addressed (map, formatting) but there are still a few things to implement like Dataset.take, Dataset.skip, IterableDataset.set_format, IterableDataset.formatted_as, IterableDataset.reset_format. The rest cannot be implemented for the general case. E.g. train_test_split and select can only work on an iterable dataset if the underlying dataset format allows it (we need to know the number of rows and have some sort of random access) |
It appears |
Thanks, I updated the docstrings. Would be cool to have more examples in the docs though, if this is something you'd like to contribute ;) |
Intro
items marked like
thisare done already :)Currently the two classes have two distinct API for processing:
The
.map()
methodBoth have those parameters in common: function, batched, batch_size
IterableDataset is missing those parameters:
with_indices, with_rank,input_columns,drop_last_batch,remove_columns, features, disable_nullable, fn_kwargs, num_procDataset also has additional parameters that are exclusive, due to caching:
keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, suffix_template, new_fingerprint
There is also an important difference in terms of behavior:Dataset.map adds new columns (with dict.update)
BUT
IterableDataset discards previous columns (it overwrites the dict)
IMO the two methods should have the same behavior. This would be an important breaking change though.
Dataset.map is eager while IterableDataset.map is lazy
The
.shuffle()
methodBoth have an optional seed parameter, but IterableDataset requires a mandatory parameter buffer_size to control the size of the local buffer used for approximate shuffling.IterableDataset is missing the parameter generatorAlso Dataset has exclusive parameters due to caching: keep_in_memory, load_from_cache_file, indices_cache_file_name, writer_batch_size, new_fingerprint
The
.with_format()
methodset_format
,reset_format
orformatted_as
are also missingOther methods
remove_columns
methodcast,cast_column,filter,rename_column,rename_columns, class_encode_column, flatten, prepare_for_task, train_test_split, shardQuestions
I think it would be nice to be able to switch between streaming and regular dataset easily, without changing the processing code significantly.
IMO the minimum is to align the main processing methods.
It would mean aligning breaking the current
Iterable.map
to have the same behavior asDataset.map
(add columns with dict.update), and add multiprocessing as well as the missing parameters. DONE ✅It would also mean implementing the missing methods: cast, cast_column, filter, rename_column, rename_columns, class_encode_column, flatten, prepare_for_task, train_test_split, shard. WIP 🟠
The main breaking change would be the change of behavior of
IterableDataset.map
, because currently it discards all the previous columns instead of keeping them. DONE ✅I agree the simplest would be to have the exact same methods for both Dataset and IterableDataset. However this is probably not a good idea because it would prevent users from using the best benefits of them. That's why we can keep some aspects of regular datasets as they are:
We could have a completely aligned
map
method if both methods were lazy by default, but this is a very big breaking change so I'm not sure we can consider doing that.For information, TFDS does lazy map by default, and has an additional
.cache()
method.Opinions ?
I'd love to gather some opinions about this here. If the two APIs are more aligned it would be awesome for the examples in
transformers
, and it would create a satisfactory experience for users that want to switch from one mode to the other.cc @mariosasko @albertvillanova @thomwolf @patrickvonplaten @sgugger
The text was updated successfully, but these errors were encountered: