-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(JobAttachments)!: Add 'last seen on S3' cache #172
Conversation
eb3afb7
to
6dee14c
Compare
6dee14c
to
20a6760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work! I made a few Nit comments.
Also there is just one thing I would like to bring up: Should we consider adding a mechanism to handle the case where the S3 cache actually have a mismatch between the S3 bucket?
We might be able to allow customers to skip the cache, or force-rebuild the cache, if they want. E.g. If they delete an object in their bucket, submit a job, then the job will fail due to missing object.
20a6760
to
8ed5f69
Compare
Yeah, currently I just added the error message telling customers to delete the cache manually, but you're right that we should expose some other option to clear / skip the cache so customers don't have to go digging through folders. I am a bit concerned about the additional scope that would add to this PR, so maybe we can add that as a followup? Exactly what the UX would look like is a little unclear to me right now. For the CLI args, I think it's pretty straightforward, we could add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks!
8ed5f69
to
3ca970a
Compare
3ca970a
to
9fe8dd0
Compare
Breaking changes: - mostly, this is a significant change in behaviour (i.e. the addition of the S3 check cache) and warrants a new version - the S3AssetUploader.upload_assets public function was changed to accommodate the S3 cache path. Customers that use custom Python scripts that use the S3AssetUploader may need to update their scripts to specify the cache path - list_object_threshold was removed from the CLI config Signed-off-by: Caden Marofke <marofke@amazon.com>
9fe8dd0
to
988a5a8
Compare
Breaking changes: - mostly, this is a significant change in behaviour (i.e. the addition of the S3 check cache) and warrants a new version - the S3AssetUploader.upload_assets public function was changed to accommodate the S3 cache path. Customers that use custom Python scripts that use the S3AssetUploader may need to update their scripts to specify the cache path - list_object_threshold was removed from the CLI config Signed-off-by: Caden Marofke <marofke@amazon.com>
What was the problem/requirement? (What/Why)
We currently do a
list_objects_v2
S3 API call to check the existence of files in the Job Attachments content-addressed storage. We originally added this as the paginated call ended up being faster than calling HeadObject for large submissions (we arbitrarily picked 100 files as the threshold to switch over to list objects at the time). However, this leads to terrible performance once the bucket has a large number of files, since list_objects must iterate through the entire bucket to find the files we're looking for (although we can sometimes get lucky and find our files before iterating the whole bucket, if the submission doesn't contain any new files).Also, we don't have a clear error message if a customer chooses to add their own customer-managed KMS key to the bucket encryption, and they forget to add KMS permissions to the Queue role.
What was the solution? (How)
a. This caused me to restructure our code a bit, making a
caches
module, and moving all the hash cache stuff there, along with making aCacheDB
base class to group the common codeb. I then added the
S3CheckCache
(I couldn't come up with a better name), which simply stores the S3 object key (including the bucket), and the time we saw it in S3c. When we go to upload files, we check if the S3 key exists in the S3 cache, and if the entry hasn't expired (I set it to 30 days for now), then we assume the file still exists in S3
d. I also updated our 404 S3 errors, as customers could get in a bit of a pickle if their local S3 cache says a file is in S3, but someone went in there and deleted the file. For now, customers would just have to delete their local S3 cache and resubmit, which would rebuild the cache.
What is the impact of this change?
We should have really speedy resubmissions for files that exist in S3 already. This should also save customers money on S3 API calls.
How was this change tested?
Was this change documented?
I plan on adding some 'troubleshooting 404 errors for Jobs' info to our docs to highlight that it could be due to the S3 cache being out of date and steps to manually delete the cache.
Is this a breaking change?
Yes