Calling Dataset.add_external_files(source_url) with a list of files will eventually fail due to too many requests to external storage. #813

freddessert · 2022-11-01T19:27:59Z

Describe the bug

Our current setup for adding external files include a lot of S3 links and we use GCS for our dataset storage. If we add over ~40 files as a list to add_external_files, it will eventually fail due to too many requests to update state.json in GCS. Like the implementation with assigning a folder we should be able to only have to update state.json once.

To reproduce

dataset = Dataset.get(dataset_id="<dataset_id>")
dataset_local_path = "/bulk_images/"
urls = read_urls_from_csv() # A list of over 10k S3 urls.

dataset.add_external_files(source_url=urls, dataset_path=dataset_local_path)

Expected behaviour

The whole list of urls should've been added to the dataset, not just the first ~40.

Environment

Server type: self hosted
ClearML SDK Version: 1.7.2
ClearML Server Version: WebApp: 1.6.0-213 • Server: 1.6.0-213 • API: 2.20
Python Version: 3.8.10
OS: Linux

Related Discussion

Slack thread: https://clearml.slack.com/archives/CTK20V944/p1666879476962449

The text was updated successfully, but these errors were encountered:

jkhenning · 2022-11-05T20:04:58Z

Hi @freddessert,

Thanks for this report, we're working on it 🙂

…`add_external_files()` (#813)

pollfly · 2022-11-14T05:09:58Z

Hey @freddessert! v1.8.0 is now out supporting limiting the number of _serialize requests when adding a list of links with the add_external_files() method

freddessert · 2022-11-14T19:06:55Z

Hi @pollfly I am not sure how to use it exactly? I couldn't find a reference to it in the source code to try it out.

jkhenning · 2022-11-14T21:28:57Z

Hi @freddessert,

Just upgrade to ClearML SDK v1.8.0 and try again, it should work, I think

freddessert · 2022-12-06T00:16:02Z

Hi @jkhenning I think the fix might've introduce a bug 😬 #845

jkhenning · 2023-03-15T13:10:37Z

Closing this issue as v1.9.3 is already out (solving both issues, I believe). Please reopen if it's still relevant.

freddessert added the bug Something isn't working label Nov 1, 2022

clearml-bot pushed a commit that referenced this issue Nov 9, 2022

Limit number of _serialize requests when adding list of links with …

b793f2d

…`add_external_files()` (#813)

jkhenning closed this as completed Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling Dataset.add_external_files(source_url) with a list of files will eventually fail due to too many requests to external storage. #813

Calling Dataset.add_external_files(source_url) with a list of files will eventually fail due to too many requests to external storage. #813

freddessert commented Nov 1, 2022 •

edited

Loading

jkhenning commented Nov 5, 2022

pollfly commented Nov 14, 2022

freddessert commented Nov 14, 2022

jkhenning commented Nov 14, 2022

freddessert commented Dec 6, 2022

jkhenning commented Mar 15, 2023

Calling Dataset.add_external_files(source_url) with a list of files will eventually fail due to too many requests to external storage. #813

Calling Dataset.add_external_files(source_url) with a list of files will eventually fail due to too many requests to external storage. #813

Comments

freddessert commented Nov 1, 2022 • edited Loading

Describe the bug

To reproduce

Expected behaviour

Environment

Related Discussion

jkhenning commented Nov 5, 2022

pollfly commented Nov 14, 2022

freddessert commented Nov 14, 2022

jkhenning commented Nov 14, 2022

freddessert commented Dec 6, 2022

jkhenning commented Mar 15, 2023

freddessert commented Nov 1, 2022 •

edited

Loading