-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid another extra getObjectMetadata request #983
Avoid another extra getObjectMetadata request #983
Conversation
If client already knows last modified time for S3 object, there is no reason to pull object metadata from the S3 service inside TransferManager just to retrieve this information again.
Will you consider merging it? If not, what concerns do you have? |
@@ -77,7 +74,7 @@ public synchronized ObjectMetadata getObjectMetadata() { | |||
if (s3Object != null) { | |||
return s3Object.getObjectMetadata(); | |||
} | |||
return objectMetadata; | |||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a breaking change as customers who are getting object metadata before will get null now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they wont. Null will only be returned to the customers who will be using the new API where object metadata is not retrieved before object is downloaded. After that even while using new API, object metadata will be populated properly. It's definitely not a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the s3Object is only populated for serial downloads. For parallel downloads, the s3Object will always be null. That is the reason we store object metadata and return it for parallel downloads.
* @see GetObjectRequest#GetObjectRequest(String, String, String) | ||
* @see GetObjectRequest#GetObjectRequest(String, String, boolean) | ||
*/ | ||
public GetObjectRequest(String bucketName, String key, Long lastModifiedTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We model the request and response classes as per service team (Amazon S3 in this case) specifications. The members in GetObjectRequest are based on the S3 Get Object specification. We cannot add it to the request class if we are not sending it over wire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if you can't accept changes in GetObjectRequest, the other approach is to modify doDownload in TransferManager and pass this timestamp as a separate parameter instead of part of GetObjectRequest. Will you accept such approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use object metadata in 3 different places during Download operation.
- Get last modification time.
- Get content length
- In DownloadImpl task, return metadata when requested by customer. See the other comment.
It is highly unlikely that customers will provide 1 and 2 values. Even then, we need object metadata to use in (3) for parallel downloads. The only solution I see to avoid this call is providing the object metadata as a parameter to download method. This is a rare case as most customers want to download an object from bucket/key values and won't have object metadata. So if we expect them to provide it, they have to make an additional call to get object metadata before calling download method which is not great user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use case might looks like not being very likely, but it's legit use case and even AWS recommends to use external lookup tables for storing file data attributes outside of S3 (such as key, size, timestamp). Based on this and various other use cases, I can't see a reason why metadata needs to be pulled 100% of the time ahead of object itself. Yes, it might be needed sometimes, but it's not always needed and there is no way to avoid this extra roundtrip now (which btw was avoidable using earlier versions of AWS SDK, so it should be considered as a regression). Refer (for example to 1.9.25) to
aws-sdk-java/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/transfer/TransferManager.java
Line 747 in 5907106
private Download doDownload(final GetObjectRequest getObjectRequest, |
First, object metadata is available as soon as object download process is started as it's delivered as part of HEAD. For the purpose of pause/resume, there is no reason to pull the metadata beforehand as last modification time could be obtained from the object itself when pause action will be triggered because it's part of S3Object which is about to be put on pause (which might not even happened). If download not even started, it's not needed, right?
Content length could be already known to the client as part of listAvailableObjects() or by other measures and should be used instead of another round trip to the S3 (if available).
As for the point 3, it wasn't the case before (
Line 54 in 5907106
public ObjectMetadata getObjectMetadata() { |
Yes, probably the current solution offered by me is a bit straightforward and nicer one is required, but let's decide how to fix it. I'm open to suggestions. It doesn't make sense to sequentially pull the metadata for each object just to know information we already know...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your concerns. I agree that making HEAD request to get metadata for each call might be excessive. I am worried that if we remove object metadata from DownloadImpl, it might be a breaking change. I have to think some more on how to fix it in a clean way. I don't have bandwidth this week to work on this. I will look at this issue next week and get back to you. Thanks.
I agree that it doesn't make sense to add lastModifiedTime to GetObjectRequest. However, what about adding a new overloaded version of TransferManager.download() that takes an ObjectMetadata class? If specified and non-null, it can be used in place of having doDownload() make its own GetObjectMetadataRequest. A similar approach can be used in TransferManager.copy() in order to resolve #988. |
The suggestion about passing ObjectMetadata to download() is not very appealing as it's not that the real object metadata is present (with all headers, etc) - it's only range & timestamp that is relevant for TransferManager here. So, aside from range & last modified date there is nothing to fill into it and it seems wrong to partially initialize the object when the whole purpose now for pulling it is to obtain range & timestamp. Range could be in GetObjectRequest and it leaves only timestamp as a potential initialized object inside this ObjectMetadata instance. Could it be better to just pass this timestamp to the overloaded version of download() and tweak doDownload to only do the pull of metadata if both range & timestamp are not present? If yes, I could create a new pull request for it. |
My suggestion was coming more from the perspective of fixing it this way in TransferManager.copy() (where you very well might already have a full ObjectMetadata instance that you could then pass to TransferManager.copy()) and then also fixing it similarly here in TransferManager.download(). You could conceivably already have a full ObjectMetadata object in this case too, right? I guess that's not quite what you had in mind though, which is why you are suggesting passing just the lastModifiedTime. Could you please describe a little more about how you might already have this lastModifiedTime but not a full ObjectMetadata object? (I'm not saying there's no case but am just curious what your exact use case is.) |
Oops, nvm, I just realized that you already explained this above. |
Sorry about the long delay on this one. Did we still want to iterate on this? We're mostly code frozen for transfer manager on 1.11.x at this point, other than for fixing bugs. We'll make a note to integrate this feedback when we implement 2.0.x's transfer manager: aws/aws-sdk-java-v2#37. Feel free to add any other transfer manager-specific feature requests on that issue as well. |
I think we can close this one and implement this as part of v2 instead. What is the estimated ETA for v2 to be in production status? |
Thanks! We haven't published a date we'll GA for V2. Once we figure it out, we'll give an announcement on a blog post. |
If client already knows last modified time for S3 object,
there is no reason to pull object metadata from the S3 service
inside TransferManager just to retrieve this information again.