You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following #4184, currently a dataset saved using save_to_disk doesn't actually contain the bytes of the image or audio files. Instead it stores the path to your local files.
Adding embed_external_files and set it to True by default to save_to_disk would be kind of a breaking change since some users will get bigger Arrow files when updating the lib, but the advantages are nice:
the resulting dataset is self contained, in case you want to delete your cache for example or share it with someone else
users also upload these Arrow files to cloud storage via the fs parameter, and in this case they would expect to upload a self-contained dataset
consistency with push_to_hub
This can be implemented at the same time as sharding for save_to_disk for efficiency, and reuse the helpers from push_to_hub to embed the external files.
Following #4184, currently a dataset saved using
save_to_disk
doesn't actually contain the bytes of the image or audio files. Instead it stores the path to your local files.Adding
embed_external_files
and set it to True by default to save_to_disk would be kind of a breaking change since some users will get bigger Arrow files when updating the lib, but the advantages are nice:This can be implemented at the same time as sharding for
save_to_disk
for efficiency, and reuse the helpers frompush_to_hub
to embed the external files.cc @mariosasko
The text was updated successfully, but these errors were encountered: