-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(job_attachments): use TransferManager for upload and download #191
Conversation
ce38fae
to
1f8f547
Compare
1f8f547
to
4d718b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few suggestions for improvement.
# and complete_multipart_upload throws an error if parts is empty. | ||
num_parts = max(1, int(math.ceil(file_size / float(chunk_size)))) | ||
transfer_kwargs = { | ||
"preferred_transfer_client": "auto", # "auto" enables CRT-based client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this supports a parameter called max_bandwidth
, that will be super-useful when wanting to share this connection on an interactive workstation with other things. Would be great to support this as an option in the deadline cloud config file too! (Maybe as a follow-up)
f26ec3a
to
579d131
Compare
# if we thread too aggressively on slower internet connections. So for now let's set it to 5, | ||
# which would the number of threads with one processor. | ||
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: | ||
with concurrent.futures.ThreadPoolExecutor(max_workers=num_download_workers) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this entirely with the transfer manager implementation? Does it provide a parallelism parameter?
@@ -515,8 +529,7 @@ def _download_files_parallel( | |||
(file_bytes, local_file_name) = future.result() | |||
if local_file_name: | |||
downloaded_file_names.append(str(local_file_name.resolve())) | |||
if file_bytes == 0 and progress_tracker: | |||
# If the file size is 0, the download progress should be tracked by the number of files. | |||
if progress_tracker: | |||
progress_tracker.increase_processed(1, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that 0 be file_bytes
?
Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
579d131
to
ca49b0a
Compare
What was the problem/requirement? (What/Why)
create_multipart_upload
,upload_part
,complete_multipart_upload
, ...) This way of using low-level APIs was primarily to meet the requirement to be able to cancel the upload of a single, potentially large, file in the middle.However, this approach introduced unnecessary complexity by directly handling multipart upload operations. It was suggested that using a
TransferManager
could offer a better solution while still supporting the cancellation of uploads midway.urllib3.connectionpool:Connection pool is full
warnings messages when uploading job attachments.What was the solution? (How)
Refactored the upload implementation to leverage boto3's
TransferManager
. This change allows us to maintain the capability to cancel file uploads midway while simplifying the upload process by replacing the low-level multipart upload management.TransferManager
-based approachTo address the issue of "connection pool is full" warning messages showing up during job submissions, I increased the default
max_pool_connections
for the S3 client from 10 to 50. This adjustment helps prevent exceeding the max pool size limit when uploading or downloading files concurrently. Also, I made it configurable through the client'sconfig
file. Users can now configure this max_pool_connection value by setting:in the config file.
The number of thread workers (
max_workers
forconcurrent.futures.ThreadPoolExecutor()
) for upload and download are now determined as follows:max_pool_connections
/ min(k
,S3_UPLOAD_MAX_CONCURRENCY
) where k issmall_file_threshold_multiplier
max_pool_connections
/S3_DOWNLOAD_MAX_CONCURRENCY
What is the impact of this change?
This refactoring introduces several improvements:
TransferManager
. Maintains or improves upload performance, especially in scenarios with many small files.How was this change tested?
hatch run test
hatch run integ:test
Was this change documented?
No.
Is this a breaking change?
No.