Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(job_attachments): use TransferManager for upload and download #191

Merged
merged 1 commit into from
Feb 29, 2024

Conversation

gahyusuh
Copy link
Contributor

@gahyusuh gahyusuh commented Feb 27, 2024

What was the problem/requirement? (What/Why)

  • Originally, Job Attachment's upload was implemented using an approach of manually managing multipart uploads (by calling create_multipart_upload, upload_part, complete_multipart_upload, ...) This way of using low-level APIs was primarily to meet the requirement to be able to cancel the upload of a single, potentially large, file in the middle.
    However, this approach introduced unnecessary complexity by directly handling multipart upload operations. It was suggested that using a TransferManager could offer a better solution while still supporting the cancellation of uploads midway.
  • We were seeing urllib3.connectionpool:Connection pool is full warnings messages when uploading job attachments.
  • The interruption/cancellation of download (input syncing in workers) was not very responsive. It takes way too long to complete the cancellation of download, especially when the number of files is very big, over 1000.

What was the solution? (How)

  • Refactored the upload implementation to leverage boto3's TransferManager. This change allows us to maintain the capability to cancel file uploads midway while simplifying the upload process by replacing the low-level multipart upload management.

    • Performance tests were done to compare the old approach and the new TransferManager-based approach
      • It showed not much differences in upload speeds across various file sizes and quantities.
      • There was one notable improvement, though: When a job bundle consists of a large number of small files (< 8 MB, which is a default chunk size for multipart upload).
  • To address the issue of "connection pool is full" warning messages showing up during job submissions, I increased the default max_pool_connections for the S3 client from 10 to 50. This adjustment helps prevent exceeding the max pool size limit when uploading or downloading files concurrently. Also, I made it configurable through the client's config file. Users can now configure this max_pool_connection value by setting:

    [settings]
    s3_max_pool_connections = 50
    

    in the config file.
    The number of thread workers (max_workers for concurrent.futures.ThreadPoolExecutor()) for upload and download are now determined as follows:

    • the number of thread workers for upload = max_pool_connections / min(k, S3_UPLOAD_MAX_CONCURRENCY) where k is small_file_threshold_multiplier
    • the number of thread workers for download = max_pool_connections / S3_DOWNLOAD_MAX_CONCURRENCY

What is the impact of this change?

This refactoring introduces several improvements:

  • Simplifies the upload implementation by using TransferManager. Maintains or improves upload performance, especially in scenarios with many small files.
  • Reduces the time to cancel downloads, particularly for a large number of files, improving user experience during long-running download.
  • Resolves the issue of "connection pool is full" warnings by adjusting the S3 client's max pool connections size.

How was this change tested?

  • Made sure that all unit tests passed by running hatch run test
  • Made sure that the Job Attachment integ tests passed by running hatch run integ:test
  • The change was tested by performing job submissions with various combinations of file sizes and amounts. I compared upload speeds between the old approach and the new TransferManager-based approach, revealing minimal performance differences with a significant improvement in handling a large number of small files.
  • Confirmed that "connection pool is full" warnings are not showing up during concurrent file uploads and downloads.
  • The download cancellation was tested by running scripted_tests/download_cancel_test.py. With a job bundle of a large number of files, showing a drastic reduction in cancellation times. For example, from approximately 8-10 minutes to 1-2 minutes for 5000 files.

Was this change documented?

No.

Is this a breaking change?

No.

@gahyusuh gahyusuh marked this pull request as ready for review February 27, 2024 21:49
@gahyusuh gahyusuh requested a review from a team as a code owner February 27, 2024 21:49
mwiebe
mwiebe previously approved these changes Feb 27, 2024
Copy link
Contributor

@mwiebe mwiebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few suggestions for improvement.

src/deadline/job_attachments/_aws/aws_config.py Outdated Show resolved Hide resolved
# and complete_multipart_upload throws an error if parts is empty.
num_parts = max(1, int(math.ceil(file_size / float(chunk_size))))
transfer_kwargs = {
"preferred_transfer_client": "auto", # "auto" enables CRT-based client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this supports a parameter called max_bandwidth, that will be super-useful when wanting to share this connection on an interactive workstation with other things. Would be great to support this as an option in the deadline cloud config file too! (Maybe as a follow-up)

src/deadline/job_attachments/_aws/aws_config.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/download.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/download.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/download.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/upload.py Show resolved Hide resolved
src/deadline/job_attachments/upload.py Outdated Show resolved Hide resolved
# if we thread too aggressively on slower internet connections. So for now let's set it to 5,
# which would the number of threads with one processor.
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
with concurrent.futures.ThreadPoolExecutor(max_workers=num_download_workers) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this entirely with the transfer manager implementation? Does it provide a parallelism parameter?

@@ -515,8 +529,7 @@ def _download_files_parallel(
(file_bytes, local_file_name) = future.result()
if local_file_name:
downloaded_file_names.append(str(local_file_name.resolve()))
if file_bytes == 0 and progress_tracker:
# If the file size is 0, the download progress should be tracked by the number of files.
if progress_tracker:
progress_tracker.increase_processed(1, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that 0 be file_bytes?

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
@gahyusuh gahyusuh merged commit 41b5964 into mainline Feb 29, 2024
18 checks passed
@gahyusuh gahyusuh deleted the gahyusuh/crt_upload branch February 29, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants