Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(JobAttachments)!: Add 'last seen on S3' cache #172

Merged
merged 1 commit into from
Feb 8, 2024

Conversation

marofke
Copy link
Contributor

@marofke marofke commented Feb 5, 2024

What was the problem/requirement? (What/Why)

We currently do a list_objects_v2 S3 API call to check the existence of files in the Job Attachments content-addressed storage. We originally added this as the paginated call ended up being faster than calling HeadObject for large submissions (we arbitrarily picked 100 files as the threshold to switch over to list objects at the time). However, this leads to terrible performance once the bucket has a large number of files, since list_objects must iterate through the entire bucket to find the files we're looking for (although we can sometimes get lucky and find our files before iterating the whole bucket, if the submission doesn't contain any new files).

Also, we don't have a clear error message if a customer chooses to add their own customer-managed KMS key to the bucket encryption, and they forget to add KMS permissions to the Queue role.

What was the solution? (How)

  1. Removed the list_objects_v2 API call, it's no good for our use case
  2. Similar to our Hash Cache, added a client-side SQLite DB to cache when we've found objects in the S3 cache
    a. This caused me to restructure our code a bit, making a caches module, and moving all the hash cache stuff there, along with making a CacheDB base class to group the common code
    b. I then added the S3CheckCache (I couldn't come up with a better name), which simply stores the S3 object key (including the bucket), and the time we saw it in S3
    c. When we go to upload files, we check if the S3 key exists in the S3 cache, and if the entry hasn't expired (I set it to 30 days for now), then we assume the file still exists in S3
    d. I also updated our 404 S3 errors, as customers could get in a bit of a pickle if their local S3 cache says a file is in S3, but someone went in there and deleted the file. For now, customers would just have to delete their local S3 cache and resubmit, which would rebuild the cache.
  3. When we get 403 S3 errors, check if it's KMS-related, and print a more specific error message

What is the impact of this change?

We should have really speedy resubmissions for files that exist in S3 already. This should also save customers money on S3 API calls.

How was this change tested?

  • wrote new unit tests, ran the integration tests
  • did multiple submissions from the CLI, confirmed that submission time was sped up in subsequent submissions
  • added KMS encryption to my bucket, tried uploading without permissions to the key, verified I got the new error message
  • did a job submission, confirmed it uploaded, resubmitted, confirmed in the debug logs it found an entry in the S3 cache, then changed the Job Attachments root prefix in the Queue settings, and resubmitted, confirming that files were uploaded (i.e. ensured that the cache only applied to each specific root prefix)

Was this change documented?

I plan on adding some 'troubleshooting 404 errors for Jobs' info to our docs to highlight that it could be due to the S3 cache being out of date and steps to manually delete the cache.

Is this a breaking change?

Yes

  • mostly, this is a significant change in behaviour (i.e. the addition of the S3 check cache) and warrants a new version
  • the S3AssetUploader.upload_assets public function was changed to accommodate the S3 cache path. Customers that use custom Python scripts that use the S3AssetUploader may need to update their scripts to specify the cache path
  • the list_object_threshold was removed from the CLI config

@marofke marofke force-pushed the marofke/upload-cache branch 6 times, most recently from eb3afb7 to 6dee14c Compare February 6, 2024 18:09
@marofke marofke marked this pull request as ready for review February 6, 2024 18:12
@marofke marofke requested a review from a team as a code owner February 6, 2024 18:12
Copy link
Contributor

@gahyusuh gahyusuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work! I made a few Nit comments.

Also there is just one thing I would like to bring up: Should we consider adding a mechanism to handle the case where the S3 cache actually have a mismatch between the S3 bucket?
We might be able to allow customers to skip the cache, or force-rebuild the cache, if they want. E.g. If they delete an object in their bucket, submit a job, then the job will fail due to missing object.

test/unit/deadline_job_attachments/caches/test_caches.py Outdated Show resolved Hide resolved
test/unit/deadline_job_attachments/caches/test_caches.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/download.py Outdated Show resolved Hide resolved
src/deadline/client/api/_submit_job_bundle.py Outdated Show resolved Hide resolved
@marofke
Copy link
Contributor Author

marofke commented Feb 7, 2024

Also there is just one thing I would like to bring up: Should we consider adding a mechanism to handle the case where the S3 cache actually have a mismatch between the S3 bucket? We might be able to allow customers to skip the cache, or force-rebuild the cache, if they want. E.g. If they delete an object in their bucket, submit a job, then the job will fail due to missing object.

Yeah, currently I just added the error message telling customers to delete the cache manually, but you're right that we should expose some other option to clear / skip the cache so customers don't have to go digging through folders. I am a bit concerned about the additional scope that would add to this PR, so maybe we can add that as a followup?

Exactly what the UX would look like is a little unclear to me right now. For the CLI args, I think it's pretty straightforward, we could add a --rebuild-cache flag which deletes the cache files and makes new ones. For the GUI, I assume we put this in the Job Attachments tab, but that could be easily missed, and we don't want to preserve a checkbox setting that always rebuilds the cache...

Copy link
Contributor

@gahyusuh gahyusuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks!

src/deadline/job_attachments/caches/hash_cache.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/caches/s3_check_cache.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/caches/s3_check_cache.py Outdated Show resolved Hide resolved
src/deadline/job_attachments/download.py Outdated Show resolved Hide resolved
test/unit/deadline_job_attachments/caches/test_caches.py Outdated Show resolved Hide resolved
Breaking changes:

- mostly, this is a significant change in behaviour
  (i.e. the addition of the S3 check cache) and warrants
  a new version
- the S3AssetUploader.upload_assets public function was changed
  to accommodate the S3 cache path. Customers that use custom
  Python scripts that use the S3AssetUploader may need to update
  their scripts to specify the cache path
- list_object_threshold was removed from the CLI config

Signed-off-by: Caden Marofke <marofke@amazon.com>
@marofke marofke merged commit 99ebaea into mainline Feb 8, 2024
18 checks passed
YutongLi291 pushed a commit that referenced this pull request Feb 14, 2024
Breaking changes:

- mostly, this is a significant change in behaviour
  (i.e. the addition of the S3 check cache) and warrants
  a new version
- the S3AssetUploader.upload_assets public function was changed
  to accommodate the S3 cache path. Customers that use custom
  Python scripts that use the S3AssetUploader may need to update
  their scripts to specify the cache path
- list_object_threshold was removed from the CLI config

Signed-off-by: Caden Marofke <marofke@amazon.com>
@marofke marofke deleted the marofke/upload-cache branch February 21, 2024 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants