Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support max_concurrency in upload_blob and download_blob operations #420

Merged
merged 2 commits into from
Aug 5, 2023

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Aug 3, 2023

  • Exposes max_concurrency=None kwarg in async fs methods that use the azure SDK upload_blob and download_blob (which azure then uses to parallelize async chunk uploads/downloads)
  • Adds fs level AzureBlobFileSystem.max_concurrency attribute which is used whenever method-level max_concurrency is not set
    • Defaults to fsspec.asyn._get_batch_size()
  • For fs methods that support multiple file batching, max_concurrency defaults to 1 for the individual file upload/download.
    • So by default batch_size=... is used to parallelize uploads/downloads at the file level (in something like fs._get() or fs._put()), and no additional parallelization is done for chunks within each file.
    • To get both file level and chunk level parallelization the user can set both batch_size and max_concurrency. i.e. fs.get(path, batch_size=4, max_concurrency=2) would download up to 4 files at a time, and up to 2 chunks at a time within each file, giving an overall concurrency of up to 8 async download coroutines being run in the loop at a time.

Closes #268 (and supercedes the changes in the concurrent_io branch)

This PR is incompatible with #329 (but there was discussion in that PR regarding changing the name of the parameter used to something other than max_concurrency since it conflicts with the the azure SDK parameter)

account_name=storage.account_name, connection_string=CONN_STR
account_name=storage.account_name,
connection_string=CONN_STR,
max_concurrency=1,
Copy link
Contributor Author

@pmrowla pmrowla Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrency cannot be used for this test, otherwise storage.insert_time will be the timestamp of the first completed chunk and creation_time will be the timestamp of the finished operation (after all chunks are uploaded).

@efiop
Copy link
Member

efiop commented Aug 4, 2023

@hayesgb @TomAugspurger Hey folks. Could you take a look, please? Just want to make sure you are fine with this.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one question but feel free to merge.

adlfs/spec.py Show resolved Hide resolved
@efiop
Copy link
Member

efiop commented Aug 4, 2023

Btw, for the record, some comparisons to show how much this patch speeds things up for us iterative/dvc-azure#54 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parallel download support
3 participants