Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling Dataset.add_external_files(source_url) with a list of files will eventually fail due to too many requests to external storage. #813

Closed
freddessert opened this issue Nov 1, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@freddessert
Copy link
Contributor

freddessert commented Nov 1, 2022

Describe the bug

Our current setup for adding external files include a lot of S3 links and we use GCS for our dataset storage. If we add over ~40 files as a list to add_external_files, it will eventually fail due to too many requests to update state.json in GCS. Like the implementation with assigning a folder we should be able to only have to update state.json once.

To reproduce

dataset = Dataset.get(dataset_id="<dataset_id>")
dataset_local_path = "/bulk_images/"
urls = read_urls_from_csv() # A list of over 10k S3 urls.

dataset.add_external_files(source_url=urls, dataset_path=dataset_local_path)

Expected behaviour

The whole list of urls should've been added to the dataset, not just the first ~40.

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.7.2
  • ClearML Server Version: WebApp: 1.6.0-213 • Server: 1.6.0-213 • API: 2.20
  • Python Version: 3.8.10
  • OS: Linux

Related Discussion

Slack thread: https://clearml.slack.com/archives/CTK20V944/p1666879476962449

@freddessert freddessert added the bug Something isn't working label Nov 1, 2022
@jkhenning
Copy link
Member

Hi @freddessert,

Thanks for this report, we're working on it 🙂

clearml-bot pushed a commit that referenced this issue Nov 9, 2022
@pollfly
Copy link
Contributor

pollfly commented Nov 14, 2022

Hey @freddessert! v1.8.0 is now out supporting limiting the number of _serialize requests when adding a list of links with the add_external_files() method

@freddessert
Copy link
Contributor Author

Hi @pollfly I am not sure how to use it exactly? I couldn't find a reference to it in the source code to try it out.

@jkhenning
Copy link
Member

Hi @freddessert,

Just upgrade to ClearML SDK v1.8.0 and try again, it should work, I think

@freddessert
Copy link
Contributor Author

Hi @jkhenning I think the fix might've introduce a bug 😬 #845

@jkhenning
Copy link
Member

Closing this issue as v1.9.3 is already out (solving both issues, I believe). Please reopen if it's still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants