You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our current setup for adding external files include a lot of S3 links and we use GCS for our dataset storage. If we add over ~40 files as a list to add_external_files, it will eventually fail due to too many requests to update state.json in GCS. Like the implementation with assigning a folder we should be able to only have to update state.json once.
To reproduce
dataset = Dataset.get(dataset_id="<dataset_id>")
dataset_local_path = "/bulk_images/"
urls = read_urls_from_csv() # A list of over 10k S3 urls.
dataset.add_external_files(source_url=urls, dataset_path=dataset_local_path)
Expected behaviour
The whole list of urls should've been added to the dataset, not just the first ~40.
Hey @freddessert! v1.8.0 is now out supporting limiting the number of _serialize requests when adding a list of links with the add_external_files() method
Describe the bug
Our current setup for adding external files include a lot of S3 links and we use GCS for our dataset storage. If we add over ~40 files as a list to
add_external_files
, it will eventually fail due to too many requests to updatestate.json
in GCS. Like the implementation with assigning a folder we should be able to only have to updatestate.json
once.To reproduce
Expected behaviour
The whole list of urls should've been added to the dataset, not just the first ~40.
Environment
Related Discussion
Slack thread: https://clearml.slack.com/archives/CTK20V944/p1666879476962449
The text was updated successfully, but these errors were encountered: