feat(JobAttachments)!: Add 'last seen on S3' cache #172

marofke · 2024-02-05T23:13:42Z

What was the problem/requirement? (What/Why)

We currently do a list_objects_v2 S3 API call to check the existence of files in the Job Attachments content-addressed storage. We originally added this as the paginated call ended up being faster than calling HeadObject for large submissions (we arbitrarily picked 100 files as the threshold to switch over to list objects at the time). However, this leads to terrible performance once the bucket has a large number of files, since list_objects must iterate through the entire bucket to find the files we're looking for (although we can sometimes get lucky and find our files before iterating the whole bucket, if the submission doesn't contain any new files).

Also, we don't have a clear error message if a customer chooses to add their own customer-managed KMS key to the bucket encryption, and they forget to add KMS permissions to the Queue role.

What was the solution? (How)

Removed the list_objects_v2 API call, it's no good for our use case
Similar to our Hash Cache, added a client-side SQLite DB to cache when we've found objects in the S3 cache
a. This caused me to restructure our code a bit, making a caches module, and moving all the hash cache stuff there, along with making a CacheDB base class to group the common code
b. I then added the S3CheckCache (I couldn't come up with a better name), which simply stores the S3 object key (including the bucket), and the time we saw it in S3
c. When we go to upload files, we check if the S3 key exists in the S3 cache, and if the entry hasn't expired (I set it to 30 days for now), then we assume the file still exists in S3
d. I also updated our 404 S3 errors, as customers could get in a bit of a pickle if their local S3 cache says a file is in S3, but someone went in there and deleted the file. For now, customers would just have to delete their local S3 cache and resubmit, which would rebuild the cache.
When we get 403 S3 errors, check if it's KMS-related, and print a more specific error message

What is the impact of this change?

We should have really speedy resubmissions for files that exist in S3 already. This should also save customers money on S3 API calls.

How was this change tested?

wrote new unit tests, ran the integration tests
did multiple submissions from the CLI, confirmed that submission time was sped up in subsequent submissions
added KMS encryption to my bucket, tried uploading without permissions to the key, verified I got the new error message
did a job submission, confirmed it uploaded, resubmitted, confirmed in the debug logs it found an entry in the S3 cache, then changed the Job Attachments root prefix in the Queue settings, and resubmitted, confirming that files were uploaded (i.e. ensured that the cache only applied to each specific root prefix)

Was this change documented?

I plan on adding some 'troubleshooting 404 errors for Jobs' info to our docs to highlight that it could be due to the S3 cache being out of date and steps to manually delete the cache.

Is this a breaking change?

Yes

mostly, this is a significant change in behaviour (i.e. the addition of the S3 check cache) and warrants a new version
the S3AssetUploader.upload_assets public function was changed to accommodate the S3 cache path. Customers that use custom Python scripts that use the S3AssetUploader may need to update their scripts to specify the cache path
the list_object_threshold was removed from the CLI config

gahyusuh

Thank you for your work! I made a few Nit comments.

Also there is just one thing I would like to bring up: Should we consider adding a mechanism to handle the case where the S3 cache actually have a mismatch between the S3 bucket?
We might be able to allow customers to skip the cache, or force-rebuild the cache, if they want. E.g. If they delete an object in their bucket, submit a job, then the job will fail due to missing object.

test/unit/deadline_job_attachments/caches/test_caches.py

src/deadline/job_attachments/download.py

src/deadline/client/api/_submit_job_bundle.py

marofke · 2024-02-07T18:29:38Z

Also there is just one thing I would like to bring up: Should we consider adding a mechanism to handle the case where the S3 cache actually have a mismatch between the S3 bucket? We might be able to allow customers to skip the cache, or force-rebuild the cache, if they want. E.g. If they delete an object in their bucket, submit a job, then the job will fail due to missing object.

Yeah, currently I just added the error message telling customers to delete the cache manually, but you're right that we should expose some other option to clear / skip the cache so customers don't have to go digging through folders. I am a bit concerned about the additional scope that would add to this PR, so maybe we can add that as a followup?

Exactly what the UX would look like is a little unclear to me right now. For the CLI args, I think it's pretty straightforward, we could add a --rebuild-cache flag which deletes the cache files and makes new ones. For the GUI, I assume we put this in the Job Attachments tab, but that could be easily missed, and we don't want to preserve a checkbox setting that always rebuilds the cache...

gahyusuh

Looks great to me, thanks!

src/deadline/job_attachments/caches/cache_db.py

src/deadline/job_attachments/caches/hash_cache.py

src/deadline/job_attachments/caches/s3_check_cache.py

src/deadline/job_attachments/caches/cache_db.py

src/deadline/job_attachments/download.py

test/unit/deadline_job_attachments/caches/test_caches.py

Breaking changes: - mostly, this is a significant change in behaviour (i.e. the addition of the S3 check cache) and warrants a new version - the S3AssetUploader.upload_assets public function was changed to accommodate the S3 cache path. Customers that use custom Python scripts that use the S3AssetUploader may need to update their scripts to specify the cache path - list_object_threshold was removed from the CLI config Signed-off-by: Caden Marofke <marofke@amazon.com>

marofke force-pushed the marofke/upload-cache branch 6 times, most recently from eb3afb7 to 6dee14c Compare February 6, 2024 18:09

marofke marked this pull request as ready for review February 6, 2024 18:12

marofke requested a review from a team as a code owner February 6, 2024 18:12

marofke force-pushed the marofke/upload-cache branch from 6dee14c to 20a6760 Compare February 6, 2024 19:26

gahyusuh reviewed Feb 7, 2024

View reviewed changes

marofke force-pushed the marofke/upload-cache branch from 20a6760 to 8ed5f69 Compare February 7, 2024 18:20

gahyusuh approved these changes Feb 7, 2024

View reviewed changes

marofke force-pushed the marofke/upload-cache branch from 8ed5f69 to 3ca970a Compare February 7, 2024 22:15

epmog requested changes Feb 8, 2024

View reviewed changes

marofke force-pushed the marofke/upload-cache branch from 3ca970a to 9fe8dd0 Compare February 8, 2024 20:34

marofke force-pushed the marofke/upload-cache branch from 9fe8dd0 to 988a5a8 Compare February 8, 2024 21:53

epmog approved these changes Feb 8, 2024

View reviewed changes

marofke merged commit 99ebaea into mainline Feb 8, 2024
18 checks passed

client-software-ci mentioned this pull request Feb 16, 2024

chore(release): 0.38.0 #182

Merged

marofke deleted the marofke/upload-cache branch February 21, 2024 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(JobAttachments)!: Add 'last seen on S3' cache #172

feat(JobAttachments)!: Add 'last seen on S3' cache #172

marofke commented Feb 5, 2024 •

edited

Loading

gahyusuh left a comment •

edited

Loading

marofke commented Feb 7, 2024

gahyusuh left a comment

feat(JobAttachments)!: Add 'last seen on S3' cache #172

feat(JobAttachments)!: Add 'last seen on S3' cache #172

Conversation

marofke commented Feb 5, 2024 • edited Loading

What was the problem/requirement? (What/Why)

What was the solution? (How)

What is the impact of this change?

How was this change tested?

Was this change documented?

Is this a breaking change?

gahyusuh left a comment • edited Loading

Choose a reason for hiding this comment

marofke commented Feb 7, 2024

gahyusuh left a comment

Choose a reason for hiding this comment

marofke commented Feb 5, 2024 •

edited

Loading

gahyusuh left a comment •

edited

Loading